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Abstract 

Despite years of intensive research, Byzantine fault-tolerant (BFT) systems have not yet been 
adopted in practice. This is due to additional cost of BFT in terms of resources, protocol complex¬ 
ity and performance, compared with crash fault-tolerance (CFT). This overhead of BFT comes 
from the assumption of a powerful adversary that can fully control not only the Byzantine faulty 
machines, but at the same time also the message delivery schedule across the entire network, ef¬ 
fectively inducing communication asynchrony and partitioning otherwise correct machines at will. 
To many practitioners, however, such strong attacks appear irrelevant. 

In this paper, we introduce cross fault tolerance or XFT, a novel approach to building reli¬ 
able and secure distributed systems and apply it to the classical state-machine replication (SMR) 
problem. In short, an XFT SMR protocol provides the reliability guarantees of widely used asyn¬ 
chronous CFT SMR protocols such as Paxos and Raft, but also tolerates Byzantine faults in 
combination with network asynchrony, as long as a majority of replicas are correct and commu¬ 
nicate synchronously. This allows the development of XFT systems at the price of CFT (already 
paid for in practice), yet with strictly stronger resilience than CFT — sometimes even stronger 
than BFT itself. 

As a showcase for XFT, we present XPaxos, the first XFT SMR protocol, and deploy it in a 
geo-replicated setting. Although it offers much stronger resilience than CFT SMR at no extra 
resource cost, the performance of XPaxos matches that of the state-of-the-art CFT protocols. 


1 Introduction 

Tolerance to any kind of service disruption, whether caused by a simple hardware fault or by a large- 
scale disaster, is key for the survival of modern distributed systems. Cloud-scale applications must 
be inherently resilient, as any outage has direct implications on the business behind them [24]. 

Modern production systems (e.g., [13, 8]) increase the number of nines of reliability^ by employing 
sophisticated distributed protocols that tolerate crash machine faults as well as network faults, such 
as network partitions or asynchrony, which reflect the inability of otherwise correct machines to 
communicate among each other in a timely manner. At the heart of these systems typically lies a 
crash fault-tolerant (CFT) consensus-based state-machine replication (SMR) primitive [35, 10]. 

These systems cannot deal with non-crash (or Byzantine [29]) faults, which include not only 
malicious, adversarial behavior, but also arise from errors in the hardware, stale or corrupted data 
from storage systems, memory errors caused by physical effects, bugs in software, hardware faults due 
to ever smaller circuits, and human mistakes that cause state corruptions and data loss. However, 
such problems do occur in practice — each of these faults has a public record of taking down major 
production systems and corrupting their service [14, 4] . 

*Work done while being a PhD student at EURECOM. 

^As an illustration, five nines reliability means that a system is up and correctly running at least 99.999% of the 
time. In other words, malfunction is limited to one hour every 10 years on average. 
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Despite more than 30 years of intensive research since the seminal work of Lamport, Shostak and 
Pease [29], no practical answer to tolerating non-crash faults has emerged so far. In particular, asyn¬ 
chronous Byzantine fault-tolerance (BFT), which promises to resolve this problem [9], has not lived 
up to this expectation, largely because of its extra cost compared with CFT. Namely, asynchronous 
(that is, “eventually synchronous” [18]) BFT SMR must use at least 3t -|- 1 replicas to tolerate t 
non-crash faults [7] instead of only 2t -|- 1 replicas for CFT, as used by Paxos [27] or Raft [33], for 
example. 

The overhead of asynchronous BFT is due to the extraordinary power given to the adversary, 
which may control both the Byzantine faulty machines and the entire network in a coordinated way. 
In particular, the classical BFT adversary can partition any number of otherwise correct machines at 
will. In line with observations by practitioners [25], we claim that this adversary model is actually 
too strong for the phenomena observed in deployed systems. For instance, accidental non-crash faults 
usually do not lead to network partitions. Even malicious non-crash faults rarely cause the whole 
network to break down in wide-area networks and geo-replicated systems. The proverbial all-powerful 
attacker as a common source behind those faults is a popular and powerful simplification used for the 
design phase, but it has not seen equivalent proliferation in practice. 

In this paper, we introduce XFT (short for cross fault tolerance), a novel approach to building 
efficient resilient distributed systems that tolerate both non-crash (Byzantine) faults and network 
faults (asynchrony). In short, XFT allows building resilient systems that 

• do not use extra resources (replicas) compared with asynchronous CFT; 

• preserve all reliability guarantees of asynchronous CFT (that is, in the absence of Byzantine 
faults); and 

• provide correct service (i.e., safety and liveness [2]) even when Byzantine faults do occur, as long 
as a majority of the replicas are correct and can communicate with each other synchronously 
(that is, when a minority of the replicas are Byzantine-faulty or partitioned because of a network 
fault). 

In particular, we envision XFT for wide-area or geo-replicated systems [13], as well as for any other 
deployment where an adversary cannot easily coordinate enough network partitions and Byzantine- 
faulty machine actions at the same time. 

As a showcase for XFT, we present XPaxos, the first state-machine replication protocol in the XFT 
model. XPaxos tolerates faults beyond crashes in an efficient and practical way, achieving much greater 
coverage of realistic failure scenarios than the state-of-the-art CFT SMR protocols, such as Paxos or 
Raft. This comes without resource overhead as XPaxos uses 2t-|-l replicas. To validate the performance 
of XPaxos, we deployed it in a geo-replicated setting across Amazon EC2 datacenters worldwide. In 
particular, we integrated XPaxos within Apache ZooKeeper, a prominent and widely used coordination 
service for cloud systems [19]. Our evaluation on EC2 shows that XPaxos performs almost as well in 
terms of throughput and latency as a WAN-optimized variant of Paxos, and significantly better than 
the best available BET protocols. In our evaluation, XPaxos even outperforms the native GET SMR 
protocol built into ZooKeeper [20] . 

Einally, and perhaps surprisingly, we show that XET can offer strictly stronger reliability guaran¬ 
tees than state-of-the-art BET, for instance under the assumption that machine faults and network 
faults occur as independent and identically distributed random variables, for certain probabilities. 
To this end, we calculate the number of nines of consistency (system safety) and availability (system 
liveness) of resource-optimal GET, BET and XET (e.g., XPaxos) protocols. Whereas XET always pro¬ 
vides strictly stronger consistency and availability guarantees than GET and always strictly stronger 
availability guarantees than BET, our reliability analysis shows that, in some cases, XET also provides 
strictly stronger consistency guarantees than BET. 

The remainder of this paper is organized as follows. In Section 2, we define the system model, 
which is then followed by the definition of the XET model in Section 3. In Section 4 and Section 5, 
we present XPaxos and its evaluation in the geo-replicated context, respectively. Section 6 provides 
simplified reliability analysis comparing XET with GET and BET. We overview related work and 


2 


conclude in Section 7. The full pseudocode and correctness proof of XPaxos is given in Appendix B 
and C. 

2 System model 

Machines. We consider a message-passing distributed system containing a set II of n = |n| machines, 
also called replicas. Additionally, there is a separate set C of client machines. 

Clients and replicas may suffer from Byzantine faults: we distinguish between crash faults, where 
a machine simply stops all computation and communication, and non-crash faults, where a machine 
acts arbitrarily, but cannot break cryptographic primitives we use (cryptographic hashes, MACs, 
message digests and digital signatures). A machine that is not faulty is called correct. We say a 
machine is benign if the machine is correct or crash-faulty. We further denote the number of replica 
faults at a given moment s by 

• tc('S): the number of crash-faulty replicas, and 

• tncis): the number of non-crash-faulty replicas. 

Network. Each pair of replicas is connected with reliable point-to-point bi-directional communication 
channels. In addition, each client can communicate with any replica. 

The system can be asynchronous in the sense that machines may not be able to exchange messages 
and obtain responses to their requests in time. In other words, network faults are possible; we define 
a network fault as the inability of some correct replicas to communicate with each other in a timely 
manner, that is, when a message exchanged between two correct replicas cannot be delivered and 
processed within delay A, known to all replicas. Note that A is a deployment specific parameter: we 
discuss practical choices for A in the context of our geo-replicated setting in Section 5. Finally, we 
assume an eventually synchronous system in which, eventually, network faults do not occur [18]. 

Note that we model an excessive processing delay as a network problem and not as an issue related 
to a machine fault. This choice is made consciously, rooted in the experience that for the general class 
of protocols considered in this work, a long local processing time is never an issue on correct machines 
compared with network delays. 

To help quantify the number of network faults, we first give the definition of partitioned replica. 

Definition 1 (Partitioned replica). Replica p is partitioned ifp is not in the largest subset of replicas, 
in which every pair of replicas can communicate among each other within delay A. 

If there is more than one subset with the maximum size, only one of them is recognized as the 
largest subset. For example in Figure 1, the number of partitioned replicas is 3, counting either the 
group of pi, Pi and ps or that of p 2 , P 3 and p^. The number of partitioned replicas can be as much as 
n — 1, which means that no two replicas can communicate with each other within delay A. We say 
replica p is synchronous if p is not partitioned. We now quantify network faults at a given moment s 
as 


• tp{s): the number of correct, but partitioned replicas. 

Problem. In this paper, we focus on the deterministic state-machine replication problem (SMR) 
[35]. In short, in SMR clients invoke requests, which are then committed by replicas. SMR ensures 

• safety, or consistency, by (a) enforcing total order across committed client’s requests across all 
correct replicas; and by (b) enforcing validity, i.e., that a correct replica commits a request only 
if it was previously invoked by a client; 

• liveness, or availability, by eventually committing a request by a correct client at all correct 
replicas and returning an application-level reply to the client. 
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Figure 1: An illustration of partitioned replicas: {pi,P 4 ,P 5 } or {P 2 ,P 3 ,P 5 } are partitioned based on 
Definition 1. 


3 The XFT model 

This section introduces the XFT model and relates it to the established crash-fault tolerance (CFT) 
and Byzantine-fault tolerance (BFT) models. 

3.1 XFT in a nutshell 



Maximum number of each type of replica faults 

non-crash faults 

crash faults 

partitioned replicas 

Asynchronous CFT (e.g., Paxos [28]) 

consistency 

0 

n 

n — 1 

availability 

0 

[^J (combined) 

Asynchronous BFT (e.g., PBFT [9]) 

consistency 

L^J 

n 

n — 1 

availability 


(combined) 

(Authenticated) Synchronous BFT (e.g., [29]) 

consistency 

n — 1 

n 

0 

availability 

n — 1 (combined) 

0 

XFT (e.g., XPaxos) 

consistency 

0 

n 

n — 1 

[^\ (combined) 

availability 

(combined) 


Table 1: The maximum numbers of each type of fault tolerated by representative SMR protocols. 
Note that XFT provides consistency in two modes, depending on the occurrence of non-crash faults. 

Classical CFT and BFT explicitly model machine faults only. These are then combined with an 
orthogonal network fault model, either the synchronous model (where network faults in our sense 
are ruled out), or the asynchronous model (which includes any number of network faults). Hence, 
previous work can be classified into four categories; synchronous CFT [16, 35], asynchronous CFT 
[35, 27, 32], synchronous BFT [29, 17, 6], and asynchronous BFT [9, 3]. 

XFT, in contrast, redefines the boundaries between machine and network fault dimensions: XFT 
allows the design of reliable protocols that tolerate crash machine faults regardless of the number of 
network faults and that, at the same time, tolerate non-crash machine faults when the number of 
machines that are either faulty or partitioned is within a threshold. 

To formalize XFT, we first define anarchy, a very severe system condition with actual non-crash 
machine (replica) faults and plenty of faults of different kinds, as follows: 

Definition 2 (Anarchy). The system is in anarchy at a given moment s iff tncis) > 0 and tc{s) -|- 
^nc('S) T ^p(®) 

Here, t is the threshold of replica faults, such that t < In other words, in anarchy, some 

replica is non-crash-faulty, and there is no correct and synchronous majority of replicas. Armed with 
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the definition of anarchy, we can dehne XFT protocols for an arbitrary distributed computing problem 
in function of its safety property [2] . 

Definition 3 (XFT protocol). Protocol P is an XFT protocol if P satisfies safety in all executions 
in which the system is never in anarchy. 

Liveness of an XFT protocol will typically depend on a problem and implementation. For instance, 
for deterministic SMR we consider in this paper, our XPaxos protocol eventually satisfies liveness, 
provided a majority of replicas is correct and synchronous. This can be shown optimal. 

3.2 XFT vs. CFT/BFT 

Table 1 illustrates differences between XFT and CFT/BFT in terms of their consistency and avail¬ 
ability guarantees for SMR. 

State-of-the-art asynchronous CFT protocols [28, 33] guarantee consistency despite any number 
of crash-faulty replicas and any number of partitioned replicas. They also guarantee availability 
whenever a majority of replicas {t < are correct and synchronous. As soon as a single machine 

is non-crash-faulty, CFT protocols guarantee neither consistency nor availability. 

Optimal asynchronous BFT protocols [9, 22, 3] guarantee consistency despite any number of crash- 
faulty or partitioned replicas, with at most t = non-crash-faulty replicas. They also guarantee 

availability with up to combined faults, i.e., whenever more than two-thirds of replicas are 

correct and not partitioned. Note that BFT availability might be weaker than that of CFT in the 
absence of non-crash faults — unlike CFT, BFT does not guarantee availability when the sum of 
crash-faulty and partitioned replicas is in the range [n/3,n/2). 

Synchronous BFT protocols (e.g., [29]) do not consider the existence of correct, but partitioned 
replicas. This makes for a very strong assumption — and helps synchronous BFT protocols that use 
digital signatures for message authentication (so called authenticated protocols) to tolerate up to n — 1 
non-crash-faulty replicas. 

In contrast, XFT protocols with optimal resilience, such as our XPaxos, guarantee consistency in 
two modes: (i) without non-crash faults, despite any number of crash-faulty and partitioned replicas 
(i.e., just like CFT), and (ii) with non-crash faults, whenever a majority of replicas are correct and 
not partitioned, i.e., provided the sum of all kinds of faults (machine or network faults) does not 
exceed Similarly, it also guarantees availability whenever a majority of replicas are correct 

and not partitioned. 

It may be tempting to view XFT as some sort of a combination of the asynchronous CFT and 
synchronous BFT models. However, this is misleading, as even with actual non-crash faults, XFT 
is incomparable to authenticated synchronous BFT. Specifically, authenticated synchronous BFT 
protocols, such as the seminal Byzantine Generals protocol [29], may violate consistency with a single 
partitioned replica. For instance, with n = 5 replicas and an execution in which three replicas are 
correct and synchronous, one replica is correct but partitioned and one replica is non-crash-faulty, the 
XFT model mandates that the consistency be preserved, whereas the Byzantine Generals protocol 
may violate consistency.^ 

Furthermore, from Table 1, it is evident that XFT offers strictly stronger guarantees than asyn¬ 
chronous CFT, for both availability and consistency. XFT also offers strictly stronger availability 
guarantees than asynchronous BFT. Finally, the consistency guarantees of XFT are incomparable to 
those of asynchronous BFT. On the one hand, outside anarchy, XFT is consistent with the number of 
non-crash faults in the range [n/3, n/2), whereas asynchronous BFT is not. On the other hand, unlike 
XFT, asynchronous BFT is consistent in anarchy provided the number of non-crash faults is less than 
n/3. We discuss these points further in Section 6, where we also quantify the reliability comparison 
between XFT and asynchronous CFT/BFT assuming the special case of independent faults. 

^XFT is not stronger than authenticated synchronous BFT either, as the latter tolerates more machine faults in the 
complete absence of network faults. 
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3.3 Where to use XFT? 

The intuition behind XFT starts from the assumption that “extremely bad” system conditions, such 
as anarchy, are very rare, and that providing consistency guarantees in anarchy might not be worth 
paying the asynchronous BFT premium. 

In practice, this assumption is plausible in many deployments. We envision XFT for use cases in 
which an adversary cannot easily coordinate enough network partitions and non-crash-faulty machine 
actions at the same time. Some interesting candidate use cases include: 

• Tolerating “accidental” non-crash faults. In systems which are not susceptible to malicious 
behavior and deliberate attacks, XFT can be used to protect against “accidental" non-crash 
faults, which can be assumed to be largely independent of network faults. In such cases, XFT 
could be used to harden CFT systems without considerable overhead of BFT. 

• Wide-area networks and geo-replicated systems. XFT may reveal useful even in cases where the 
system is susceptible to malicious non-crash faults, as long as it may be difficult or expensive 
for an adversary to coordinate an attack to compromise Byzantine machines and partition 
sufficiently many replicas at the same time. Particularly interesting for XFT are WAN and 
geo-replicated systems which often enjoy redundant communication paths and typically have a 
smaller surface for network-level DoS attacks (e.g., no multicast storms and flooding). 

• Blockchain. A special case of geo-replicated systems, interesting to XFT, are blockchain systems. 
In a typical blockchain system, such as Bitcoin [31], participants may be financially motivated 
to act maliciously, yet may lack the means and capabilities to compromise the communication 
among (a large number of) correct participants. In this context, XFT is particularly interesting 
for so-called permissioned blockchains, which are based on state-machine replication rather than 
on Bitcoin-style proof-of-work [39]. 

4 XPaxos Protocol 
4.1 XPaxos overview 

XPaxos is a novel state-machine replication (SMR) protocol designed specifically in the XFT model. 
XPaxos specifically targets good performance in geo-replicated settings, which are characterized by the 
network being the bottleneck, with high link latency and relatively low, heterogeneous link bandwidth. 

In a nutshell, XPaxos consists of three main components: 

• A common-case protocol, which replicates and totally orders requests across replicas. This 
has, roughly speaking, the message pattern and complexity of communication among replicas of 
state-of-the-art CFT protocols (e.g.. Phase 2 of Paxos), hardened by the use of digital signatures. 

• A novel view-change protocol, in which the information is transferred from one view (system 
configuration) to another in a decentralized, leaderless fashion. 

• A fault detection (FD) mechanism, which can help detect, outside anarchy, non-crash faults 
that would leave the system in an inconsistent state in anarchy. The goal of the FD mechanism 
is to minimize the impact of long-lived non-crash faults (in particular “data loss” faults) in the 
system and to help detect them before they coincide with a sufficient number of crash faults 
and network faults to push the system into anarchy. 

XPaxos is orchestrated in a sequence of views [9]. The central idea in XPaxos is that, during 
common-case operation in a given view, XPaxos synchronously replicates clients’ requests to only 
t 1 replicas, which are the members of a synchronous group (out of n = 2t -|- 1 replicas in total). 
Each view number i uniquely determines the synchronous group, sgi, using a mapping known to all 
replicas. Every synchronous group consists of one primary and t followers, which are jointly called 
active replicas. The remaining t replicas in a given view are called passive replicas; optionally, passive 
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Figure 2: XPaxos common-case message patterns (a) for the general case when t > 2 and (b) for the 
special case of t = 1. The synchronous groups are (Sol'S!, 52 ) and (so,si), respectively. 


replicas learn the order from the active replicas using the lazy replication approach [26]. A view is 
not changed unless there is a machine or network fault within the synchronous group. 

In the common case (Section 4.2), the clients send digitally signed requests to the primary, which 
are then replicated across t + 1 active replicas. These t + 1 replicas digitally sign and locally log 
the proofs for all replicated requests to their commit logs. Commit logs then serve as the basis for 
maintaining consistency in view changes. 

The view change of XPaxos (Section 4.3) reconfigures the entire synchronous group, not only the 
leader. All t + 1 active replicas of the new synchronous group sgi^i try to transfer the state from the 
preceding views to view i + 1. This decentralized approach to view change stands in sharp contrast to 
the classical reconfiguration/view-change in CFT and BFT protocols (e.g., [27, 9]), in which only a 
single replica (the primary) leads the view change and transfers the state from previous views. This 
difference is crucial to maintaining consistency (i.e., total order) across XPaxos views in the presence 
of non-crash faults (but in the absence of full anarchy). This novel and decentralized view-change 
scheme of XPaxos guarantees that even in the presence of non-crash faults, but outside anarchy, at 
least one correct replica from the new synchronous group will be able to transfer the correct 

state from previous views, as it will be able to contact some correct replica from any old synchronous 
group. 

Finally, the main idea behind the FD scheme of XPaxos is the following. In view change, a 
non-crash-faulty replica (of an old synchronous group) might not transfer its latest state to a correct 
replica in the new synchronous group. This “data loss” fault is dangerous, as it may violate consistency 
when the system is in anarchy. However, such a fault can be detected using digital signatures from 
the commit log of some correct replicas (from an old synchronous group), provided that these correct 
replicas can communicate synchronously with correct replicas from the new synchronous group. In a 
sense, with XPaxos FD, a critical non-crash machine fault must occur for the first time together with 
sufficiently many crash or partitioned machines (i.e., in anarchy) to violate consistency. 

In the following, we explain the core of XPaxos for the common case (Section 4.2), view-change 
(Section 4.3) and fault detection (Section 4.4) components. We discuss XPaxos optimizations in 
Section 4.5 and give XPaxos correctness arguments in Section 4.6. An example of XPaxos execution 
is given in Appendix A. The complete pseudocode and correctness proof are included in Appendix B 
and C. 

4.2 Common case 

Figure 2 shows the common-case message patterns of XPaxos for the general case (t > 2) and for the 
special case t = 1. XPaxos is specifically optimized for the case where t = 1, as in this case, there are 
only two active replicas in each view and the protocol is very efficient. The special case t = 1 is also 
highly relevant in practice (see e.g.. Spanner [13]). In the following, we first explain XPaxos in the 
general case, and then focus on the t = 1 special case. 

Notation. We denote the digest of a message m by D{m), whereas (m)^^ denotes a message that 
contains both D{m) signed by the private key of machine p and m. For signature verification, we 
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assume that all machines have public keys of all other processes. 


4.2.1 General case {t > 2) 

The common-case message pattern of XPaxos is shown in Figure 2a. More specifically, upon receiving 
a signed request req = (replicate, op, tSc, from client c (where op is the client’s operation and 
tSc is the client’s timestamp), the primary (say sq) (1) increments sequence number sn and assigns 
sn to req, (2) signs a message prep = (prepare, D{req), sn, and logs {req, prep) into its prepare 
log PrepareLogo[sn] (we say sq prepares req), and (3) forwards {req,prep) to all other active replicas 
(i.e, the t followers). 

Each follower Sj (1 < j < t) verifies the primary’s and client’s signatures, checks whether its local 
sequence number equals sn — 1, and logs {req,prep) into its prepare log PrepareLogj[sn]. Then, Sj 
updates its local sequence number to sn, signs the digest of the request req, the sequence number sn 
and the view number i, and sends {commit, D{req), sn, i). to all active replicas. 

Upon receiving t signed commit messages — one from each follower — such that a matching entry 
is in the prepare log, an active replica Sfc (0 < A: < t) logs prep and the t signed commit messages into 
its commit log Commit Log s^[sn]. We say Sk commits req when this occurs. Finally, Sk executes req 
and sends the authenticated reply to the client (followers may only send the digest of the reply). The 
client commits the request when it receives matching reply messages from all t + 1 active replicas. 

A client that times out without committing the requests broadcasts the request to all active 
replicas. Active replicas then forward such a request to the primary and trigger a retransmission 
timer, within which a correct active replica expects the client’s request to be committed. 

4.2.2 Tolerating a single fault (t = 1). 

When t = 1, the XPaxos common case simplifies to involving only 2 messages between 2 active replicas 
(see Figure 2b). 

Upon receiving a signed request req = (replicate, op, tSc, c)^^ from client c, the primary (sq) 
increments the sequence number sn, signs sn along the digest of req and view number i in message 
mo = (commit, D{req), sn, , stores {req, mg) into its prepare log {PrepareLogsQ [su] = {req, mo)), 
and sends the message {req, mo) to the follower si. 

On receiving {req, mo), the follower si verifies the client’s and primary’s signatures, and checks 
whether its local sequence number equals sn — 1. If so, the follower updates its local sequence number 
to sn, executes the request producing reply R{req), and signs message mi; mi is similar to mo, but 
also includes the client’s timestamp and the digest of the reply: mi = (commit, {D{req), sn, i, req.tsc, 
D{R{req)))crs^ • The follower then saves the tuple {req, mo, mi) to its commit log {CommitLogsi [sn] = 
{req, mo, mi)) and sends mi to the primary. 

The primary, on receiving a valid COMMIT message from the follower (with a matching entry in its 
prepare log), executes the request, compares the reply R{req) with the follower’s digest contained in 
mi, and stores {req, mo,mi) in its commit log. Finally, it returns an authenticated reply containing 
mi to c, which commits the request if all digests and the follower’s signature match. 

4.3 View change 

Intuition. The ordered requests in commit logs of correct replicas are the key to enforcing consistency 
(total order) in XPaxos. To illustrate an XPaxos view change, consider synchronous groups sgi and 
sgi^i of views i and i + 1, respectively, each containing t + 1 replicas. Note that proofs of requests 
committed in sgi might have been logged by only one correct replica in sgi. Nevertheless, the XPaxos 
view change must ensure that (outside anarchy) these proofs are transferred to the new view i + 1. 
To this end, we had to depart from traditional view change techniques [9, 22, 12] where the entire 
view-change is led by a single replica, usually the primary of the new view. Instead, in XPaxos 
view change, every active replica in sgi-\.i retrieves information about requests committed in preceding 
views. Intuitively, with correct majority of correct and synchronous replicas, at least one correct and 
synchronous replica from sgi^i will contact (at least one) correct and synchronous replica from sgi 
and transfer the latest correct commit log to the new view i + \. 
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Synchronous Groups 
(f G No) 

sgi 

sgi+1 

sgi+2 

Active replicas 

Primary 


^0 

Si 

Follower 

Si 

S2 

S2 

Passive replica 

S2 

Si 

So 


Table 2: Synchronous group combinations (t = 1). 



-with signature 

1 SUSPECT 

2 ViEW-CHANGE 

3 VC-FiNAL 

4 NEW-ViEW 


Figure 3: Illustration of XPaxos view change; the synchronous group is changed from (sojSi) to (sO)'S 2 )- 


In the following, we first describe how we choose active replicas for each view. Then, we explain 
how view changes are initiated, and, finally, how view changes are performed. 

4.3.1 Choosing active replicas 

To choose active replicas for view f, we may enumerate all sets containing t + 1 replicas (i.e., 
sets) which then alternate as synchronous groups across views in a round-robin fashion. In addition, 
each synchronous group uniquely determines the primary. We assume that the mapping from view 
numbers to synchronous groups is known to all replicas (see e.g.. Table 2). 

The above simple scheme works well for small number of replicas (e.g., t = 1 and t = 2). For 
a large number of replicas, the combinatorial number of synchronous groups may be inefficient. To 
this end, XPaxos can be modified to rotate only the leader, which may then resort to deterministic 
verihable pseudorandom selection of the set of / followers in each view. The exact details of such a 
scheme would, however, exceed the scope of this paper. 

4.3.2 View-change initiation 

If a synchronous group in view i (denoted by sgi) does not make progress, XPaxos performs a view 
change. Only an active replica of sgi may initiate a view change. An active replica Sj G sgi initiates 
a view change if (i) Sj receives a message from another active replica that does not conform to the 
protocol (e.g., an invalid signature), (ii) the retransmission timer at Sj expires, (iii) Sj does not 
complete a view change to view f in a timely manner, or (iv) Sj receives a valid suspect message 
for view i from another replica in sgi. Upon a view-change initiation, sj stops participating in the 
current view and sends (suspect, z, to all other replicas. 

4.3.3 Performing the view change 

Upon receiving a SUSPECT message from an active replica in view i (see the message pattern in 
Figure 3), replica Sj stops processing messages of view i and sends m = (view-change, i -I- l,Sj, 
C ommitLogsj)cTs . to the t + 1 active replicas of sgi^i. A view-change message contains the commit 
log CommitLogsj of Sj. Commit logs might be empty (e.g., if Sj was passive). 

Note that XPaxos requires all active replicas in the new view to collect the most recent state and 
its proof (i.e., view-change messages), rather than only the new primary. Otherwise, a faulty new 
primary could, even outside anarchy, purposely omit view-GHANGE messages that contain the most 
recent state. Active replica Sj in view i -|- 1 waits for at least n — t view-change messages from all, 
but also waits for 2A time, trying to collect as many messages as possible. 

Upon completion of the above protocol, each active replica sj G inserts all view-ghange 

messages it has received into set VCSetl'^^ . Then Sj sends (vg-final, z -|- 1, Sj, to every 
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active replica in view i + 1. This serves to exchange the received view-change messages among 
active replicas. 

Every active replica Sj G sg'j+i must receive VC-final messages from all active replicas in sgi^i, 
after which Sj extends the value VCSetl'^^ by combining VCSetl~^^ sets piggybacked in vc-final 
messages. Then, for each sequence number sn, an active replica selects the commit log with the 
highest view number in all view-change messages, to confirm the committed request at sn. 

Afterwards, to prepare and commit the selected requests in view i + 1, the new primary psj+i sends 
(new-view, i -|- 1, PrepareLog)fjp^.^.^ to every active replica in sgi-^i, where the array PrepareLog 
contains the prepare logs generated in view i+1 for each selected request. Upon receiving a new-view 
message, every active replica Sj G processes the prepare logs in PrepareLog as described in the 
common case (see Section 4.2). 

Finally, every active replica Sj G s^i+i makes sure that all selected requests in PrepareLog are 
committed in view i + 1. When this condition is satisfied, XPaxos can start processing new requests. 

4.4 Fault detection 

XPaxos does not guarantee consistency in anarchy. Hence, non-crash faults could violate XPaxos 
consistency in the long run, if they persist long enough to eventually coincide with enough crash or 
network faults. To cope with long-lived faults, we propose (an otherwise optional) Fault Detection 
(FD) mechanism for XPaxos. 

Roughly speaking, FD guarantees the following property: if a machine p suffers a non-crash fault 
outside anarchy in a way that would cause inconsistency in anarchy, then XPaxos FD detects p as 
faulty (outside anarchy). In other words, any potentially fatal fault that occurs outside anarchy would 
be detected by XPaxos FD. 

Here, we sketch how FD works in the case t = 1 (see Section B.4 for details), focusing on detecting 
a specific non-crash fault that may render XPaxos inconsistent in anarchy — a data loss fault by which 
a non-crash-faulty replica loses some of its commit log prior to a view change. Intuitively, data loss 
faults are dangerous as they cannot be prevented by the straightforward use of digital signatures. 

Our FD mechanism entails modifying the XPaxos view change as follows: in addition to exchanging 
their commit logs, replicas also exchange their prepare logs. Notice that in the case t = 1 only the 
primary maintains a prepare log (see Section 4.2). In the new view, the primary prepares and the 
follower commits all requests contained in transferred commit and prepare logs. 

With the above modification, to violate consistency, a faulty primary (of preceding view i) would 
need to exhibit a data loss fault in both its commit log and its prepare log. However, such a data 
loss fault in the primary’s prepare log would be detected, outside anarchy, because (i) the (correct) 
follower of view i would reply in the view change and (ii) an entry in the primary’s prepare log causally 
precedes the respective entry in the follower’s commit log. By simply verifying the signatures in the 
follower’s commit log, the fault of a primary is detected. Conversely, a data loss fault in the commit 
log of the follower of view i is detected outside anarchy by verifying the signatures in the commit log 
of the primary of view i. 

4.5 XPaxos optimizations 

Although the common-case and view-change protocols described above are sufficient to guarantee 
correctness, we applied several standard performance optimizations to XPaxos. These include check¬ 
pointing and lazy replication [26] to passive replicas (to help shorten the state transfer during view 
change) as well as batching and pipelining (to improve the throughput). 

4.5.1 Checkpointing 

Upon active replica Sj G sgi commits and executes the request with sequence number sn = k x CHK 
(refer to message pattern in Fig. 4) , Sj sends (prechk, sn, i, D{stf)), to every active replica 

Sk, where D{stf)') is the digest of the state after executing the request at sn. Upon receiving t -|- 1 
matching PRECHK messages, each active replica Sj generates the checkpoint proof message m and 
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sends it to every active replica (m = (CHKPT,sn,z, Sj)as.)- Upon receiving t + 1 matching 

CHKPT messages, each active replica Sj checkpoints the state and discards previous prepare logs and 
commit logs. 

Besides, each active replica propagates checkpoint proofs to all passive replicas by (lazychk, chkProof), 
where chkProof contains t + 1 chkpt messages. 



-with signature 

-> with MAC 

1 PRECHK 

2 CHKPT 

3 LAZYCHK 


Figure 4: XPaxos checkpointing message pattern : synchronous group is (so,si). 


4.5.2 Lazy replication 

To speed up the state transfer in view change, the followers in synchronous group lazily propagate 
the commit log to every passive replica. With lazy replication, the new active replica, which might 
be the passive replica in preceding view, could only retrieve the missing state from others. 

More specifically, (refer to message pattern in Fig. 5) in case t = 1, upon committing request req, 
the follower sends commit log of req to the passive replica. In case t > 2, either each of t followers 
sends commit log of req to one passive replica, or each follower sends a fraction of j commit logs to 
every passive replica. Only in case the bandwidth between followers and passive replicas are saturated, 
the primary is involved in lazy replication. Each passive replica commits and executes requests based 
on orders dehned by commit logs. 

Although non-crash faulty replicas can interfere with the lazy replication scheme, this would not 
impact the correctness of the protocol, but only slow down the view-change. 


- > with signature 

-> with MAC 

1 REPLICATE 

2 PREPARE 

3 COMMIT 

4 REPLY 

.> lazy replication 

Figure 5: XPaxos common-case message patterns with lazy replication for t = 1 and t > 2 (here t = 2). 
Synchronous group illustrated are (Sol'S!) (when t = 1) and (so,si,S 2 ) (when t = 2), respectively. 

Batching and pipelining. To improve the throughput of cryptographic operations, the primary 
batches several requests when preparing. The primary waits for B requests, then signs the batched 
request and sends it to every follower. If primary receives less than B requests within a time limit, 
the primary batches all requests it has received. 



with signature 


-> with MAC 


REPLICATE 

COMMIT 

REPLY 



4.6 Correctness arguments 

Consistency (Total Order). XPaxos enforces the following invariant, which is key to total order. 

Lemma 1. Outside anarchy, if a benign client c commits a request req with sequence number sn in 
view i, and a benign replica Sk commits the request req' with sn in view i' > i, then req = req'. 

A benign client c commits request req with sequence number sn in view i only after c has received 
matching replies from t + 1 active replicas in sgi. This implies that every benign replica in sgi stores 
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req into its commit log under sequence number sn. In the following, we focus on the special case 
where: i' = i + 1. This serves as the base step for the proof of Lemma 1 by induction across views 
which we give in Section C. 

Recall that, in view i' = i +1, all (benign) replicas from sgi-^-i wait for n — t = t + l view-change 
messages containing commit logs transferred from other replicas, as well as for the timer set to 2A 
to expire. Then, replicas in exchange this information within vc-final messages. Note that, 

outside anarchy, there exists at least one correct and synchronous replica in say Sj. Hence, a 

benign replica s^ that commits req' in view i + 1 under sequence number sn must have had received 
VC-FINAL from Sj. In turn, Sj waited for t-|- I VIEW-CHANGE messages (and timer 2A), so it received 
a VIEW-CHANGE message from some correct and synchronous replica Sx € sgi (such a replica exists in 
sgi as at most t replicas in sgi are non-crash-faulty or partitioned). As Sx stored req under sn in its 
commit log in view i, it forwards this information to Sj in a VIEW-CHANGE message, and Sj forwards 
this information to Sk within a vc-FiNAL. Hence req = req' follows. 

Availability. XPaxos availability is guaranteed if the synchronous group contains only correct and 
synchronous replicas. With eventual synchrony, we can assume that, eventually, there will be no 
network faults. In addition, with all combinations of t -|- 1 replicas rotating in the role of active 
replicas, XPaxos guarantees that, eventually, view change in XPaxos will complete with t + 1 correct 
and synchronous active replicas. 

5 Performance Evaluation 



Us West 1 (CA) 

Europe (EU) 

Tokyo (JP) 

Sydney (AU) 

Sao Paolo (BH) 

us East (VA) 

88 /1097 /82190 /166390 

92 /1112 /85649 /169749 

179 /1226 /81177 /165277 

268 /1372 /95074 /179174 

146 /1214 /85434 /169534 

US West 1 (CA) 


174 /1184 /1974 /15467 

120 /1133 /1180 /6210 

186 /1209 /6354 /51646 

207 /1252 /90980 /169080 

Europe (EU) 



287 /1310 /1397 /4798 

342 /1375 /3154 /11052 

233 /1257 /1382 /9188 

Tokyo (JP) 




137 /1149 /1414 /5228 

394 /2496 /11399 /94775 

Sydney (AU) 





392 /1496 /2134 /10983 


Table 3: Round-trip latency of TCP ping (hpingS) across Amazon EC2 datacenters, collected during 
three months. The latencies are given in milliseconds, in the format: average / 99.99% / 99.999% / 
maximum. 


In this section, we evaluate the performance of XPaxos and compare it to that of Zyzzyva [22], 
PBFT [9] and a WAN-optimized version of Paxos [27], using the Amazon EC2 worldwide cloud 
platform. We chose geo-replicated, WAN settings as we believe that these are a better fit for protocols 
that tolerate Byzantine faults, including XFT and BET. Indeed, in WAN settings (i) there is no single 
point of failure such as a switch interconnecting machines, (ii) there are no correlated failures due to, 
e.g., a power-outage, a storm, or other natural disasters, and (Hi) it is difficult for the adversary to 
flood the network, correlating network and non-crash faults (the last point is relevant for XFT). 

In the remainder of this section, we first present the experimental setup (Section 5.1), and then 
evaluate the performance (throughput, latency and CPU cost) in the fault-free scenario (Section 5.2 
and Section 5.3) as well as under faults (Section 5.4). Finally, we perform a performance comparison 
using a real application, the ZooKeeper coordination service [19] (Section 5.5), by comparing native 
ZooKeeper to ZooKeeper variants that use the four replication protocols mentioned above. 

5.1 Experimental setup 

5.1.1 Synchrony and XPaxos 

In a practical deployment of XPaxos, a critical parameter is the value of timeout A, i.e., the upper 
bound on the communication delay between any two correct machines. If the round-trip time (RTT) 
between two correct machines takes more than 2A, we declare a network fault (see Section 2). Notably, 
A is vital to the XPaxos view-change (Section 4.3). 

To understand the value of A in our geo-replicated context, we ran a 3-month experiment during 
which we continuously measured the round-trip latency across six Amazon EC2 datacenters worldwide 
using TCP ping (hping3). We used the least expensive EC2 micro instances, which arguably have the 
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highest probability of experiencing variable latency due to virtualization. Each instance was pinging 
all other instances every 100 ms. The results of this experiment are summarized in Table 3. While 
we detected network faults lasting up to 3 min, our experiment showed that the round-trip latency 
between any two datacenters was less than 2.5 sec 99.99% of the time. Therefore, we adopted the 
value of A = 2.5/2 = 1.25 sec. 

5.1.2 Protocols under test 

We compare XPaxos with three protocols whose common-case message patterns when t = 1 are shown 
in Figure 6. The first two are BET protocols, namely (a speculative variant of) PBFT [9] and Zyzzyva 
[22], and require 3t -|- 1 replicas to tolerate t faults. We chose PBFT because it is possible to derive 
a speculative variant of the protocol that relies on a 2-phase common-case commit protocol across 
only 2t -|- 1 replicas (Figure 6a; see also [9]). In this PBFT variant, the remaining t replicas are not 
involved in the common case, which is more efficient in a geo-replicated settings. We chose Zyzzyva 
because it is the fastest BFT protocol that involves all replicas in the common case (Figure 6b). The 
third protocol we compare against is a very efficient WAN-optimized variant of crash-tolerant Paxos 
inspired by [5, 23, 13]. We have chosen the variant of Paxos that exhibits the fastest write pattern 
(Figure 6c). This variant requires 2t + 1 replicas to tolerate t faults, but involves t -|- 1 replicas in the 
common case, i.e., just like XPaxos. 

To provide a fair comparison, all protocols rely on the same Java code base and use batching, 
with the batch size set to 20. We rely on HMAC-SHAl to compute MACs and RSA1024 to sign 
and verify signatures computed using the CryptoJ—|- [1] library that we interface with the various 
protocols using JNI. 



Figure 6: Communication patterns of the three protocols under test {t = 1). 


5.1.3 Experimental testbed and benchmarks 

We run the experiments on the Amazon EC2 platform which comprises widely distributed datacenters, 
interconnected by the Internet. Communications between datacenters have a low bandwidth and a 
high latency. We run the experiments on mid-range virtual machines that contain 8 vCPUs, 15 
GB of memory, 2 x 80 GB SSD storage, and run Ubuntu Server 14.04 LTS (PV) with the Linux 
3.13.0-24-generic x86_64 kernel. 

In the case t = 1, Table 4 gives the deployment of the different replicas at different datacenters, for 
each protocol analyzed. Clients are always located in the same datacenter as the (initial) primary to 
better emulate what is done in modern geo-replicated systems where clients are served by the closest 
datacenter [36, 13].^ 

To stress the protocols, we run a microbenchmark that is similar to the one used in [9, 22]. In 
this microbenchmark, each server replicates a null service (this means that there is no execution of 
requests). Moreover, clients issue requests in closed-loop: a client waits for a reply to its current 
request before issuing a new request. The benchmark allows both the request size and the reply size 

^In practice, modern geo-replicated system, like Spanner [13], use hundreds of CFT SMR instances across different 
partitions to accommodate geo-distributed clients. 
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to be varied. For space limitations, we only report results for two request sizes (IkB, 4kB) and one 
reply size (OkB). We refer to these microbenchmarks as 1/0 and 4/0 benchmarks, respectively. 

5.2 Fault-free performance 

We first compare the performance of protocols when t = 1 in replica configurations as shown in 
Table 4, using the 1/0 and 4/0 microbenchmarks. The results are shown in Figures 7a and 7b. In 
each graph, the X-axis shows the throughput (in kops/sec), and Y-axis the latency (in ms). 


PBFT 

Zyzzyva 

Paxos 

XPaxos 

EC 2 Region 

Primary 

Primary 

Primary 

Primary 

US West (CA) 

Active 


Active 

Follower 

US East (VA) 

Active 

Passive 

Passive 

Tokyo (JP) 

Passive 



- 

Europe (EU) 


Table 4: Configurations of replicas. Shaded replicas are not used in the common case. 



throughput (kops/sec) throughput (kops/sec) throughput (kops/sec) 

(a) 1/0 benchmark, 1 = 1 (b) 4/0 benchmark, 1=1 (c) 1/0 benchmark, 1 = 2 

Figure 7: Fault-free performance 

As we can see, in both benchmarks, XPaxos achieves a significantly better performance than PBFT 
and Zyzzyva. This is because, in a worldwide cloud environment, the network is the bottleneck and the 
message patterns of BFT protocols, namely PBFT and Zyzzyva, tend to be expensive. Compared with 
PBFT, the simpler message pattern of XPaxos allows better throughput. Compared with Zyzzyva, 
XPaxos puts less stress on the leader and replicates requests in the common case across 3 times 
fewer replicas than Zyzzyva (i.e., across t followers vs. across all other 3t replicas). Moreover, the 
performance of XPaxos is very close to that of Paxos. Both Paxos and XPaxos implement a round-trip 
across two replicas when t = 1, which renders them very efficient. 

Next, to assess the fault scalability of XPaxos, we ran the 1/0 micro-benchmark in configurations 
that tolerate two faults (t = 2). We use the following EC2 datacenters for this experiment: CA (Cal¬ 
ifornia), OR (Oregon), VA (Virginia), JP (Tokyo), EU (Ireland), AU (Sydney) and SG (Singapore). 
We place Paxos and XPaxos active replicas in the first t + 1 datacenters, and their passive replicas in 
the next t datacenters. PBFT uses the first 2t -|- 1 datacenters for active replicas and the last t for 
passive replicas. Finally, Zyzzyva uses all replicas as active replicas. 

We observe that XPaxos again clearly outperforms PBFT and Zyzzyva and achieves a performance 
very close to that of Paxos. Moreover, unlike PBFT and Zyzzyva, Paxos and XPaxos only suffer a 
moderate performance decrease with respect to the t = 1 case. 

5.3 CPU cost 

To assess the cost of using signatures in XPaxos, we extracted the CPU usage during the experiments 
presented in Section 5.2 with 1/0 and 4/0 micro-benchmarks when t = 1. During experiments, we 
periodically sampled CPU usage at the most loaded node (the primary in every protocol) with the 
top Linux monitoring tool. The results are depicted in Figure 8 for both the 1/0 and 4/0 micro¬ 
benchmarks. The X-axis represents the peak throughput (in kops/s), whereas the Y-axis represents 
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Figure 8: CPU usage when running the 1/0 and 4/0 micro-benchmarks. 


the CPU usage (in %). Not surprisingly, we observe that the CPU usage of all protocols is higher 
with the 1/0 benchmark than with the 4/0 benchmark. This comes from the fact that in the former 
case, there are more messages to handle per time unit. We also observe that the CPU usage of 
XPaxos is higher than that of other protocols, due to the use of digital signatures. Nevertheless, this 
cost remains very reasonable: never more than half of the eight cores available on the experimental 
machines were used. Note that this cost could probably be significantly reduced by using GPUs, as 
recently proposed on the EC2 platform. Moreover, compared to BFT protocols (PBFT and Zyzzyva), 
while CPU usage of XPaxos is higher, XPaxos also sustains a significantly higher throughput. 


5.4 Performance under faults 

In this section, we analyze the behavior of XPaxos under faults. We run the 1/0 micro-benchmark on 
three replicas (CA, VA, JP) to tolerate one fault (see also Table 4). The experiment starts with CA 
and VA as active replicas, and with 2500 clients in CA. At time 180 sec, we crash the follower, VA. At 
time 300 sec, we crash the CA replica. At time 420 sec, we crash the third replica, JP. Each replica 
recovers 20 sec after having crashed. Moreover, the timeout 2A (used during state transfer in view 
change. Section 4.3) is set to 2.5 sec (see Section 5.1.1). We show the throughput of XPaxos in function 
of time in Eigure 9, which also indicates the active replicas for each view. We observe that after each 
crash, the system performs a view change that lasts less than 10 sec, which is very reasonable in a 
geo-distributed setting. This fast execution of the view-change subprotocol is a consequence of lazy 
replication in XPaxos that keeps passive replicas updated. We also observe that the throughput of 
XPaxos changes with the views. This is because the latencies between the primary and the follower 
and between the primary and clients vary from view to view. 



Eigure 9: XPaxos under faults. 


5.5 Macro-benchmark: ZooKeeper 

To assess the impact of our work on real-life applications, we measured the performance achieved 
when replicating the ZooKeeper coordination service [19] using all protocols considered in this study: 
Zyzzyva, PBFT, Paxos and XPaxos. We also compare with the native ZooKeeper performance, when 
the system is replicated using the built-in Zab protocol [20]. This protocol is crash-resilient and 
requires 2t + 1 replicas to tolerate t faults. 
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We used the ZooKeeper 3.4.6 codebase. The integration of the various protocols inside ZooKeeper 
was carried out by replacing the Zab protocol. For fair comparison to native ZooKeeper, we made 
a minor modification to native ZooKeeper to force it to use (and keep) a given node as primary. To 
focus the comparison on the performance of replication protocols, and avoid hitting other system 
bottlenecks (such as storage I/O that is not very efficient in virtualized cloud environments), we store 
ZooKeeper data and log directories on a volatile tmpfs file system. The configuration tested tolerates 
one fanlt {t = 1). ZooKeeper clients were located in the same region as the primary (CA). Each client 
invokes 1 kB write operations in a closed loop. 

Figure 10 depicts the results. The X-axis represents the throughput in kops/sec. The Y-axis 
represents the latency in ms. In this macro-benchmark, we find that Paxos and XPaxos clearly 
outperform BFT protocols and that XPaxos achieves a performance close to that of Paxos. More 
surprisingly, we can see that XPaxos is more efficient than the built-in Zab protocol, although the 
latter only tolerates crash faults. For both protocols, the bottleneck in the WAN setting is the 
bandwidth at the leader, but the leader in Zab sends requests to all other 2t replicas whereas the 
XPaxos leader sends reqnests only to t followers, which yields a higher peak throughput for XPaxos. 



Figure 10: Latency vs. throughput for the ZooKeeper application (t = 1). 


6 Reliability Analysis 

In this section, we illustrate the reliability guarantees of XPaxos by analytically comparing them with 
those of the state-of-the-art asynchronous CFT and BFT protocols. For simplicity of the analysis, 
we consider the fault states of the machines to be independent and identically distributed random 
variables. 

We denote the probability that a replica is correct (resp., crash faulty) by pcorrect (resp., pcrash)- 
The probability that a replica is benign is Pbenign = Pcorrect + Pcrash- Hence, a replica is non-crash 
faulty with probability Pnon-crash = 1 - Pbenign- 

Besides, we assume there is a probability psynchrony that a replica is synchronous, where psynchrony 
is a function of A, the network, and the system environment. Therefore, the probability that a replica 
is partitioned equals 1 — Psynchrony- 

Based on the assumption that network faults and machine faults occur independently, it is straight¬ 
forward to reason for a given machine, Pbenign and Pcorrect are independent from Psynchrony- Hence, the 
probability that a machine is available (i.e., correct and synchronous) is p available = P correct x Psynchrony- 

Aligned with the industry practice, we measure reliability guarantees and coverage of fault scenar¬ 
ios using nines of reliability. Specifically, we distingnish nines of eonsisteney and nines of availability 
and use these measures to compare different fault models. We introduce a function 9of{p) that turns 
a probability p into the corresponding number of “nines”, by letting 9of{p) = [—log]^Q(l — p)\- For 
example, 9of{0.999) = 3. For brevity, 9benign stands for 9of{pbenign) -, and so on, for other probabilities 
of interest. Beyond the analysis and examples that follow, Appendix D contains additional examples 
of practical values of nines of reliability achieved by XFT, CFT and BFT protocols. 
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6.1 Consistency 

We start with the number of nines of consistency for an asynchronous CFT protocol, denoted 
by 9ofC{CFT) = 5o/(P[CFT is consistent]). As P[CFT is consistent] = p^g^ign^ ^ straightforward 
calculation yields: 


9ofC{CFT) 


n—1 

logic (1 -Pfeempn) “ Vlenign) : 

i=0 


which gives 9ofC{CFT) pc dbenign - [logic(?T-)1 for values of pbenign close to 1, when plg^ign decreases 
slowly. As a rule of thumb, for small values of n, i.e., n < 10, we have 9ofC{CFT) ~ '^benign ~ 1- 
In other words, in typical configurations, where few faults are tolerated [13], a CFT system as a 
whole loses one nine of consistency from the likelihood that a single replica is benign. 


6.1.1 XPaxos vs. CFT 

We now quantify the advantage of XPaxos over asynchronous CFT. From Table. 1, if there is no non¬ 
crash fault, or there are no more than t faults (machine faults or network faults), XPaxos is consistent, 
i.e.. 


t=LVJ 

P[XPaxos is consistent] = plg^ign + 

i=l 


n 


Pnon-crash ^ 


t—i 

E 

j=0 


n — i 


t-i-j 

j n—i—j \ ^ 

Pcrash ^ Pcorrect ^ ^ 

fc =0 


n — i — sb 

^ jP synchrony ^ 1^ — Psynchrony) 


To quantify the difference between XPaxos and CFT more tangibly, we calculated 9o/C'(XPaxos) 
and 9ofC{CFT) for all values of ^benigni ^correct Und ^synchrony (f^benign P ^correct) between 1 and 20 in 
the special cases where t = 1 and t = 2, which are most relevant in practice. For t = 1, we observed 
the following relation: 


5o/aXPaxost=i) - 9ofC{CFTt=i) = 


correct 


1? ^benign ^ ^synchrony 

^synchrony ^ corrects 

^correc^'i OtllGrwis6. 


9o/(7(XPaxost=2 

2 X ^correct 

2 X ^ correct 1 

O V , O 


- 9ofC{CFTt=2) = 

^benign P ^synchrony S-ud 
^synchrony — ^correct P Ij 
^synchrony > 2 X Qbenign Sud 


9 fee 


— ^ rnrrprt^ 


Hence, for t = 1 we observe that the number of nines of consistency XPaxos adds on top of CFT 
is proportional to the nines of probability for correct or synchronous machine. The added nines are 
not directly related to Pbenign, although pbenign > Pcomect must hold. 


Example 1. When Pbenign 0.9999 and p correct Psynchrony 0.999, we have Pnon-crash 0.0001 and 
Pcrash = 0.0009. In this example, 9 x Pnon-crash = Pcrash, he., if a machine suffers a faults 10 times, 
then one of these is a non-crash fault and the rest are crash faults. In this case, 9ofC{CFTt=i) = 
9feemgn-l = 3, whereas 9ofC{XPay.ost=i)-9ofC{CFTt=i) = ^correct-l = 2, i.e., 5o/C'(XPaxost=i) = 5. 
XPaxos adds 2 nines of consistency on top of CFT and achieves 5 nines of consistency in total. 
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Example 2. In a slightly different example, let Pbenign = Psynchrony = 0.9999 and Pcorrect = 0.999, i.e., 
the network behaves more reliably than in Example 1. 9ofC{CFTt=i) = ^benign ~ 1 = 3, whereas 
9o/C'(XPaxost=i) — 9ofC{CFTt=i) = Pcorrect = 3, i.e., 9ofC{XPaxo5t=i) = 6. XPaxos adds 3 nines of 
consistency on top of GET and achieves 6 nines of consistency in total. 

6.1.2 XPaxos vs. BET 

Recall that (see Table 1) SMR in asynchronons BET model is consistent whenever no more than 
one-third machines are non-crash fanlty. Hence, 


P[BET is consistent] = E " (1 PbenignY ^ Pbenign 

i=0 


We first examine the conditions nnder which XPaxos has stronger consistency guarantees than 
BET. Eixing the value t of tolerated faults, we observe that P[XPaxos is consistent] > P[BET is consistent] 
is equivalent to: 


2t+i , 
^benign ' / ^ 


i=l 


2t + l 


Pnon-crash ^ 


t—i 

E 

j=0 


2t + l-i 

j 


Pcrash^ 


t—i—j 

2t+i-i-j 

Pcorrect ^ ^ 

fc =0 


2t -|- 1 i j\ 2t-\-l—i—j—k 
k Psynchrony 


k * { 

(1 Psynchrony) ^ ^ ^ ( 

i=0 


j '^Plenigni^ PbenignY- 


In the special case when t = 1, the above inequality simplifies to 


Pavailable ^ Pb~ 


emgn ' 


Hence, for t = 1, XPaxos has stronger consistency guarantees than any asynchronous BET protocol 
whenever the probability that a machine is available is larger than 1.5 power of the probability that a 
machine is benign. This is despite the fact that BET is more expensive than XPaxos as t = 1 implies 
4 replicas for BET and only 3 for XPaxos. 

In terms of nines of consistency, again for t = 1 and t = 2, we calculated the difference in 
consistency between XPaxos and BET SMR, for all values of 9benign, ^correct and ^synchrony ranging 
between 1 and 20, and observed the following relation: 


9ofC{BFTt=i) - 9ofC{XPaxost=i) = 

benign 9 correct T Ij 9 benign ^ 9 synchrony and 


9 


synchrony 


= 9 


correct 


^^benign ^^^(9correc£; ^ synchroni^-i Oth.6rwiS6. 

9ofC{BFTt=2) - 9ofC{XPaxost=2) = 

2 X {^benign ^correct I 5 ^benign ^ ^synchrony 

^synchrony — ^ corrects 

1? ^synchrony ^ 2 X Qbenign 

and 9 benign ~ 9 correct i 

2, X {9i)cfiigfi ( 9 correct: 9synchrony^^ : OthcrwiSG. 


,1.5 


Note that in cases where XPaxos guarantees better consistency than BFT {pavaiiabie > Pbemgn^ 
is only “slightly” better and does not materialize in additional nines. 


it 
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Example 1 (cont’d.). Building upon our example, Pbenign = 0.9999 and psynchrony = Pcorrect = 0.999, 
we have 9ofC{BFTt=i) - 5o/C(XPaxost=i) = ^benign - ^synchrony + 1 = 2, i.e., 9o/C(XPaxost=i) = 5 
and 9ofC{BFTt=i) = 7. BFT brings 2 nines of consistency on top of XPaxos. 

Example 2 (cont’d.). Whenp^erngn = Psynchrony = 0.9999 and Pcorrect = 0.999, we have 9ofC{BFTt=i)- 
Po/C’(XPaxost=i) = 1, i.e., 5o/C'(XPaxost=i) = 6 and 9ofC{BFTt=i) = 7. XPaxos has one nine of 
consistency less than BFT (albeit the only 7th). 

6.2 Availability 

Then, we quantify the stronger availability guarantees of XPaxos over asynchronous CFT and BFT 
protocols. We dehne the number of nines of availability for protocol X, as 9ofA{X) = 9of{P\X is available]). 

Recalling that whenever + 1 active replicas in synchronous group are available, XPaxos can 

make progress despite passive replicas are benign or not, partitioned or not (see Table 1). Thus, we 

n 

have P[XPaxos is available] = ^ {l)pLaUabie x (1 P available) 

i=l^\+l 


6.2.1 XPaxos vs. CFT 


a CFT protocol (e.g., Paxos) is available whenever n — machines are correct and synchronous. 


plus other machines are benign (see Table 1). Hence, P[CFT is available] = ^ 

(.Pbenign Pavailable) 


i=n-l^\ 


;)p 


available 


Similarly to consistency analysis, we calculated 9ofA(CFT) and Po/A(XPaxos) for all values of 
‘^available and '^benign between 1 and 20 in the cases where t = 1 and t = 2. Notice that Pavaiiabie < Pbenign 
is always true, i.e., 9available < 9benign- We observed the following relation for t = 1: 


5o/A(XPaxost=i) - 9ofA{CFTt=i) = 


When t = 2, we observed: 


max(2 X 9 available 9benigni9). 


9ofA(')(.P 3 x 031 = 2 ) 3 X 9 available 1? 


9 ofA{XPaxost=2) - 9ofA{CFTt=2) = 

3 X 9available 9benignj 9benign + 3 X 9availablej 
^1) 3 X 9available — 9benign + 4 X 9 available ^ 

9benign + 4 X 9available- 

Example. When Pavaiiabie = 0.999 and Pbenign = 0.99999, we have 9ofA(XPaxost=i) — 9ofA(CFTt=i) = 
1, i.e., 9ofA(XPaxo5t=i) = 5 and 9ofA(CFTt=i) = 4. XPaxos adds 1 nine of availability on top of 
CFT and achieves 5 nines of availability in total. Besides, XPaxos adds 2 nines of availability on top 
of individual machine availability. 

6.2.2 XPaxos vs. BFT 

From Table 1, an asynchronous BFT protocol is available when n — machines are available 

n 

despite faults of other machines. Thus, P[BFT is available] = X] (T)Pavaiiabie^ (^~Pavaiiabie)"‘~’‘■ 

i=n-lti^\ 

We calculated 5o/A(XPaxos) and 9ofA(BFT) for all values of 9avaiiabie between 1 and 20 in the 
cases when t = 1 and t = 2. In this comparison 9benign does not matter. When t = 1, 

9ofA(XPaxOSt=l) = 9ofA(BFTt=l) = 2 X 9avaUable - 1- 
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On the other hand, when t = 2, 

9ofA{XPaxOSt=2) = 9ofA{BFTt=2) + 1 = 3 X davailable - 1- 

Hence, when t = 1, XPaxos has the same nnmber of nines of availability as BFT. When t = 2, 
XPaxos adds 1 nine of availability to BFT. 

7 Related work and concluding remarks 

In this paper, we introduced XFT, a novel fault-tolerance model that allows the design of efficient 
protocols that tolerate non-crash faults. We demonstrated XFT through XPaxos, a novel state- 
machine replication protocol that features many more nines of reliability than the best crash-fault- 
tolerant (CFT) protocols with roughly the same communication complexity, performance and resource 
cost. Namely, XPaxos uses 2t -|- 1 replicas and provides all the reliability guarantees of CFT, but is 
also capable of tolerating non-crash faults, as long as a majority of XPaxos replicas are correct and 
can communicate synchronously among each other. 

As XFT is entirely realized in software, it is fundamentally different from an established approach 
that relies on trusted hardware for reducing the resource cost of BFT to 2t -|- 1 replicas only [15, 30, 
21, 38]. 

XPaxos is also different from PASC [14], which makes CFT protocols tolerate a subset of Byzantine 
faults using ASC-hardening. ASC-hardening modifies an application by keeping two copies of the state 
at each replica. It then tolerates Byzantine faults under the “fault diversity” assumption, i.e., that 
a fault will not corrupt both copies of the state in the same way. Unlike XPaxos, PASC does not 
tolerate Byzantine faults that affect the entire replica (e.g., both state copies). 

In this paper, we did not explore the impact on varying the number of tolerated faults per fault 
class. In short, this approach, known as the hybrid fault model and introduced in [37] distinguishes 
the threshold of non-crash faults (say b) despite which safety should be ensured, from the threshold 
t of faults (of any class) despite which the availability should be ensured (where often b < t). The 
hybrid fault model and its refinements [II, 34] appear orthogonal to our XFT approach. 

Specihcally, Visigoth Fault Tolerance (VFT) [34] is a recent refinement of the hybrid fault model. 
Besides having different thresholds for non-crash and crash faults, VFT also refines the space between 
network synchrony and asynchrony by defining the threshold of network faults that a VFT protocol 
can tolerate. VFT is, however, different from XFT in that it fixes separate fault thresholds for non¬ 
crash and network faults. This difference is fundamental rather than notational, as XFT cannot be 
expressed by choosing specihc values of VFT thresholds. For instance, XPaxos can tolerate, with 
2t -|- 1 replicas, t partitioned replicas, t non-crash faults and t crash faults, albeit not simultaneously. 
Specifying such requirements in VFT would yield at least 3t -|- 1 replicas. In addition, VFT protocols 
have more complex communication patterns than XPaxos. That said, many of the VFT concepts 
remain orthogonal to XFT. It would be interesting to explore interactions between the hybrid fault 
model (including its refinements such as VFT) and XFT in the future. 

Going beyond the research directions outlined above, this paper opens also other avenues for 
future work. For instance, many important distributed computing problems that build on SMR, such 
as distributed storage and blockchain, deserve a novel look at them through the XFT prism. 
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Appendix A XPaxos example execution 


view / 

1 . PREPARE: 3. COMMIT: 

<r0.rl.r2> <r0> 

VC 

view-change 

5. CommitLoq 

view i+1 

S.<r0,r3> 

VC 

8. CommitLoa : 

view i+2 

0 \\\ /I 

\\\. / 4. networl 

<r0> 

\ f ^ 

\ /7. non-crash 

\ / fault 

8. <r0,r3> 

— fault 
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1 2. COMMIT: ^ 
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view / 

1. PREPARE: 3. COMMIT : 
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VC 
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6. COMMIT: 

<rO,r3> 

a) without FD 

view i+1 
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VC 

7. CommtLog 
& PrepareLog : 

9. COMMIT: 
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view i+2 

\\\ /' 
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\ / 7. non-crash 
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— fault detected 

2. COMMIT: ^ 
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5.0 

V 

7. <rO,rl,r2,r3> 




6. COMMIT : 




<r0,rl,r2,r3> 


(b) with FD 


Figure 11: XPaxos example. The view is changed from z to i + 2, due to the network fault of si and 
the non-crash fault of so, respectively. 


In Fig. 11 we give an example of XPaxos execution when t = 1. The role of each replica in each 
view is shown in Table 2. 

In Fig. 11a, view change phase proceeds without fault detection. Upon the primary sq receives 
requests rO, rl, and r2 from clients, sq prepares these requests locally and sends commit messages to 
the follower si. Then, si commits rO, rl, and r2 locally and sends commit messages to sq. Because 
of a network fault, so only receives commit message of rO in a timely manner, thus the view change 
phase to z + 1 is activated by so. During view change to z + 1, so sends the view-change message 
with commit log of ro to all active replicas in view z -|- 1 (i.e., so and S 2 ). In view z -|- 1, r3 is further 
committed by so and S 2 . After that, so is under non-crash fault and the view is changed to z -|- 2. 
During view change to z-|-2, si and S 2 provide all their commit logs to new active replicas (i.e., si and 
S 2 ), whereas non-crash faulty replica so only reports the commit log of rO. Outside anarchy, requests 
rO and r3 are committed in new view z -|- 2 by receiving the view-change message from S2 - Request 
r3 is also committed by receiving the view-change message from si. In view z -|- 2, rl is finally 
committed by every active replica. 

In example of Figure 11b, XPaxos fault detection is enabled. In view z, the execution is the same 
as in Figure 11a. During view change to z -|-1, commit log of rO and prepare logs of rl and r2 are sent 
by So, which are committed by so and S 2 in view z -|- 1, as well as the new request r3. The same as 
before, so is non-crash faulty and the view is changed to z -|- 2. During view change to z -|- 2, commit 
logs of rO, rl, r2 and r3 are sent by S 2 . At the same time, because of missing prepare log of r3, the 
fault of So is detected with the help of the view-change message from S 2 . 
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Appendix B XPaxos pseudocode 


In this appendix we give the pseudocode of XPaxos. For simplicity reason, we assume that signa- 
ture/MAC attached to each message always correctly verifies. Figure 12 gives the definition of message 
fields and local variables for all components of XPaxos. Readers can refer to Section 4 for protocol 
description. 

This appendix is organized incrementally as follows. Section B.l gives the pseudocode of XPaxos 
common case. Section B.2 gives the pseudocode of the view change mechanism. Section B.3 describes 
and gives the pseudocode of clients’ request retransmission mechanism that deals with faulty primary. 
Finally, Section B.4 depicts the modification to the view change protocol to enable Fault Detection 
and gives the pseudocode. 


Common case : 

c,op,tSc - id of the client, operation, client timestamp 
rcQc - ongoing request at client c 
n - total number of replicas 
n - set of n replicas 
i - current view number 
Sj - replica id 

sgi - set of f + 1 replicas in synchronous group in view i 

psi - the primary in view i {psi S spi) 

fsi - the follower in view f for t = 1 {fsi G spi) 

fs^ - the followers in view i for t > 2 (/sf’ e spi) 

req - client request 

rep - reply of client request 

sris^ - sequence number prepared at replica Sj 

eXsj - sequence number executed at replica Sj 

D{m) - digest of a message m 

PrepareLopsj - array of prepared proof at replica Sj 
CommitLops- - array of commit proofs at replica Sj 

View change : 

SusSetsj - set of SUSPECT messages cached for view-change at replica Sj 
timer'2^* - network establishment timer for view i 

A - maximum message delay between two correct replicas, beyond which a network fault is declared 
timer'll - view-change timer in view change to i 

VCSet\. - set of view-change messages collected in view change to i at replica Sj 
CommitLopl. - array of most recent commit proofs selected from VCSet\^ at replica Sj 
End{lop) - end index of array log 

Fault detection : 

FinalProofsj - array of f-|-1 VC-CONFIRM messages which prove that Vs^ G spi collected the same VCSet\^ 
pres- - the view number in which PrepareLops^ is generated 

FinalSet\. - set of t -|- 1 VC-final messages collected in view change to i at replica Sj 
PrepareLop\. - array of most recent prepare proof selected from VCSet\. at replica Sj 

Figure 12: XPaxos common case: Message fields and local variables. 
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B.l Common case 


In common case, we assume that all replicas are in the same view. Algorithm 1 and Algorithm 2 
describe the common case protocol when t = 1 and t > 2, respectively. Figure 2 gives the message 
pattern. 


Algorithm 1 Common case when t = 1. 

Initialization: 

client : tSc ■<— 0; reqc nil 

replica : srig^. 0; 0] PrepareLogs^ = W; CommitLogs^ = [] 

1: upon invocation of propose{op) at client c do 
2: inc(tsc) 

3: send reqc ^ (replicate, op, tSc, c)a-^ to the primary psi G sgi 

4: start timerc 

5: upon reception of req = (replicate, op, ts, c)cr^ from client c at pSi do /* primary */ 

6: inc(snpsj 

7: mps, G- {commit, D{req), snps,,i)ap,. 

8: PrepareLogps^[snps^] G- {req,mps-) 

9: send {req,mps-) to the follower fsi 

10: upon reception of {req,mps- = {commit, dreg, sn,i)a ^.) from the primary at fsi do /* follower 

7 

11: if sn = sufs- + 1 and D{req) = d^-eq then 

12: inc(snysj 

13: rep ^ execute req 

14: inc(ea:/sj 

15: ’^/si ^ {commit, D{req), sn,i,req.tsc, D{rep))af,. 

16: CommitLogfsi[sn] G- {req,mpsi,mfsi) 

17: send mjs^ to the primary pSi 

18: upon reception of mfsi = {commit, dreg, sn,i,ts,drep)af„. from the follower fsi at pSi do 
19: if D{PrepareLogpsi[sn\.req) = dreg then 

20: CommitLogpsi[sn\ G- {req,mps-,mfsi) 

21: upon CommitLogps^[eXps- + 1] ^ nil at pSi do 
22: inc(ea;psj 

23: rep G- execute CommitLogpsi[eXpsi].req 

24: if D{rep) = CommitLogps^[eXpsi].mfsi-drep then 

25: send ((reply, sn,*, ts,rep)^^^^ to CommitLogps-[eXps-].req.c 

26: upon reception of {rpsi,mfsi) from the primary psi at client c, where 
rpsi = (reply, sn,i,ts, rep)^ 

mfsi = {commit, d'^gy,sn',i',ts', drep)(Tf„. do 

27: if sn = sn' and i = i' and ts = ts' = req.tsc and D{rep) = drep then 

28: deliver rep 

29: stop timerc 
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Algorithm 2 Common case when t>\. 

Initialization: 

client : tsc ^ 0; reqc ■‘r- nil 

replica : sus- ■‘r- 0;exsj ^ 0; PrepareLogs^ = W; CommitLogs^ = [] 

1: upon invocation of propose{op) at client c do 
2: inc(tsc) 

3: send reqc ^ (replicate, op, tSc, c)a-^ to the primary psi G sgi 

4: start timerc 

5: upon reception of req = (replicate, op, ts, c)cr^ from client c at pst do /* primary */ 

6: inc(snpsj 

7: mpsi G- (prepare,D( reg),snpsi,*)o-p,. 

8 : PrepareLogps-[sn] ^ {req,mps-) 

9: send {req,mps-) to /s^ G sgi 

10: upon reception of {req,mpsi = (prepare, dreg, sR, *)(Tp,.) from the primary at /sf do /* 

follower */ 

11: if sn = sujgk + 1 and D{req) = dreq then 

12: inc(snj^^fc) 

13: PrepareLogfgk[sn\^{req,mps^) 

14: mjgk G- (commit, D(req),sn, i, 

15: send m^^k to Vs^ € sgi 

16: upon reception of m^^k = {commit, d^eq, sn,i, f s’l)„^ from every follower /sf € sgi at Sj G sgi do 
17: CommitLogsj{s'n\ G- {req,mpsi,mfgi...mjj) 

18: upon CommitLogsj [exg^ + 1] ^ nil at sj do 
19: inc(ea:sj 

20: rep G- execute CommitLogg [exg .].req 

21: send (reply, sn, t, reg.tsc, rep)^^. ^ to client c, where c = CommitLogg.[eXg-].req.c) 

22: upon reception of t + 1 reply messages (reply, sn, i, ts, rep)^,^ ^ at client c do 
23: if t + 1 REPLY messages are with the same sn, i, ts and rep and ts = req.tsc then 

24: deliver rep 

25: stop timeXc 
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B.2 View-change 

The message pattern of view-change w/o fault detection is given in Figure 3. Algorithm 3 shows the 
corresponding pseudocode. The description of view change can be found in Section 4.3. 


Algorithm 3 View change at replica Sj. 

Initialization: 

SusSetg^ ^ %-,VCSet\. %',CommitLog\. ^ [] 

1: upon suspicion of view i and Sj € sgi do 

2: send {suspect, i, Sj)a^. to Vsfc G 11 

3: upon reception of m = (suspect, and Sk G sge do 

4: SusSets ^ SusSets U {to} 

5: forward to to Vsfc G 11 

6: upon 3(suspect, i, G SusSets. do /* enter each view in order */ 

7: inc(i) (i.e., ignore any message in preceding view) 

8: send (view-change, i, Sj, C'oTOTOJtLo(?,5^-)o-,. to Vs^ G sgi 

9: if Sj G sgi then 

10: start timer^’^* 2A 

11: upon reception of to = (view-change, i, Sk, CommitLog) from replica Sk do 
12: VCSet\. ^ VCSet\. U {to} 

13: upon \VCSet\. | = n or (expiration of timerf^* and \VCSetl. | > n — t) do 
14: send {vc-FiNAL,i,Sj,VCSetl.)a,. toVsk G sgi 

15: start timer'"'^ 

16: upon reception of = {yc-EmAh,i,Sk,VCSet)a, from every Sk G sgi do 
17: VCSet\. ^ VCSet\. U {Vto : to G VCSet in any to^} 

18: for sn : l..End{\/CommitLog\3m G VCSet\. : CommitLog is in to) do 

19: CommitLogl^[sn\ •<— CommitLog[sn] with the highest view number 

20: if Sj = psi then /* primary */ 

21: for sn : \..End{CommitLog\.) do 

22: regCommitLog], .[sn].req 

23: PrepareLog[sn]-u- {reg, {ppepAPE, D{req),sn,i)cy^^) 

24: send (new-view, z, PrepareLog)^^,. to Vsfc G sgi 

25: upon reception of (new-view, z, PrepareLop)^^^^. from the primary psi do 
26: if PrepareLog is matching with CommitLog], . then 

27: PrepareLogsj -u- PrepareLog 

28: reply and process Vto G PrepareLog as in common case 

29: srisj ^ End{PrepareLog) 

30: eXsj -U- End{PrepareLog) 

31: stop timer'l‘^ 

32: else 

33: suspect view z 

34: upon expiration of timer't'^ do 
35: suspect view z 
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B.3 Request retransmission 

In order to provide availability with respect to faulty primary or followers, as well as long-lived network 
faults within the synchronous group, we propose a request retransmission mechanism which broadcasts 
the request to all active replicas upon retransmission timer expires at client side. Retransmission 
mechanism requires every active replica to monitor the progress. In case a request is not executed 
and replied in a timely manner, the correct active replica in the synchronous group will eventually 
suspect the view. 

More specifically (the pseudocode is given in Algorithm 4), if a client c does not receive the 
matching replies of request reqc in a timely manner, c re-sends reqc to all active replicas in current 
view i by (re-send, re^c). Any active replica sj G sgi, upon receiving (re-send, re^c) from c, (1) 
forwards reqc to the primary psi G sgi if Sj ^ psi, (2) starts a timer timer req^ locally, and (3) asks 
each active replica to sign the reply. Upon timer reqc expires and the active replica Sj G sgi has 
not received t + 1 signed replies, Sj suspects view i and sends the suspect message to the client c; 
otherwise, Sj forwards t + \ signed replies to client c. 

Upon receiving SUSPECT message m for view i, client c forwards m to every active replica in view 
i + 1. This step serves to guarantee that the view-change can actually happen at all correct replicas. 
Then client c forwards reqc to the primary of view i -\-l. 


Algorithm 4 Client request retransmission. 

1: upon expiration of timerc at client c do 
2: send (re-send, re(?c) to Vsj G sgi 

3: upon reception of (re-SEND, re(?c) at Sj G sgi do 
4: if Sj ^ pSi then 

5: send reqc to pSi G sgi 

6: start timeTreqc 

7: ask Vsj G sgi to sign the reply of reqc 

8: upon expiration of timerreqc at replica Sj G sgi do 
9: suspect view i 

10: send (suspect, i, to client c 

11: upon reception of m = (suspect, i, Sk)a^^ at client c and Sk G sgi and c is in view i do 

12: enter view i + 1 

13: send m to Vsj G sgi+i 

14: send reqc to psi+i 

15: start timerc 

16: upon execution of reqc at “ do 

/* sign the reply by each active replica */ 

17: send (reply, sn, i, reg.tsc, rep)o-,,. to Vsj G sgi 

18: upon reception of mj- = (reply, sn, i, ts, rep)from every s^ G sgi at replica Sj do 

/* collect < -I- 1 signed replies */ 

19: if mi, m 2 , ...,mt+i are with the same sn, i, ts and rep then 

20: replies{mi, m 2 , ■■■,mt+i\ 

21: send (signed-reply, rephes) to client c 

22: stop timerreqc 

“by line 13 or 23 in Algorithm 1 or line 20 in Algorithm 2 
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with signature 


SUSPECT 

VIEW-CHANGE 

VC-FINAL 

VC-CON FIRM 

NEW-VIEW 


Figure 13: Message pattern of XPaxos view-change with fault detection: vc-CONFIRM phase is added; 
synchronous group is changed from (Sol'S!) to (so,S 2 ). 


B.4 Fault detection 

In this section we describe XPaxos with Fault Detection (FD). Specifically, in order to detect all the 
fatal faults that can possibly violate consistency in anarchy, view change to z -|- 1 with FD includes 
the following modifications. 

• Every replica Sj appends its prepare logs PrepareLogsj into the view-change message when 
replying to active replicas in view i + 1. Besides, synchronous group sgiJ^i prepares and commits 
requests piggybacked in commit or prepare logs. The selection rule is almost the same as in view 
change without FD: for each sequence number sn, the request with the highest view number 
i' <i\s selected, either in a commit log or in a prepare log. 

• XPaxos FD additionally inserts a vc-CONFIRM phase after exchanging view-change messages 

among active replicas in view i + 1, i.e., after receiving t + 1 vc-final messages (see Figure 3 
and Figure 13 for the comparison). In VC-CONFIRM phase, every active replica Sj G •sg'i+i 
(1) detects potential faults in the view-change messages in VCSef^^^ and adds the faulty 
replica to set FSet; (2) removes faulty messages from FC'5'et*+^; and, (3) signs and sends 
(vG-GONFiRM, i -|- 1, D{V C to every active replica in s^j+i. Upon Sj G sgi+i receives 

t + 1 VC-CONFIRM messages with matching D{VCSet]^^), Sj (1) inserts the vc-CONFIRM mes¬ 
sages into set FinalProofsj [* + 1]; and (2) prepares and commits the requests selected based on 
VCSetl'^^. FinalProofsj[i + 1] serves to prove that t + 1 active replicas in sgi have agreed on 
the set of filtered view-change messages. 

• Every replica Sj appends FinalProofsj[i'] into the view-ghange message when replying to 
active replicas in new view, where i' is the view in which PrepareLogsj is generated. In case a 
prepare log in PrepareLogsj is not consistent with some commit log, FinalProofsj[i] can prove 
that there exists correct replica sj G sgi’ which can prove the fault of the prepare log. 

Algorithm 5 gives the modifications based on Algorithm 3 for XPaxos with fault detection mech¬ 
anism. Algorithm 6 enumerates all types of faults that can and must be detected by correct active 
replicas. Eigure 13 gives the new message pattern. 
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Algorithm 5 Modifications for fault detection at replica sj. 

Initialization: 

FinalProofsj [JiprCsj ^ FinalSet\. ■<r- PrepareLogl^ W;FSet ^ [] 

/* replace line 8 in Algorithm 3 by : */ 

1: send m = (view-CHANGE, i, Sj, CommitLogs^., PrepareLogs.,FinalProofs-[pres-])as- to 

Vs/i; € sgi 


/* replace line 11 in Algorithm 3 by : */ 

2: upon reception of m = (view-change, i, Sk, CommitLog, PrepareLog, FinalProof)a-^^ from replica 

Sk do 


/* replace lines 18^24 in Algorithm 3 by : */ 

3: FAULTDETECTiON(vcS'et*^.) /* refer to Algorithm 6 */ 

4: for Vm : m G vcSet\_ and m from replica s G FSet do 

5: remove m from vcSetl 

6: send (vc-CONFIRM, i, L)(rcS'et*^ ))cr,^. to Vsfe G sgi 


7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


/* new event handler */ 

upon reception of m* = (vc-CONFIRM, i, d„cSet)cr,j^ from every Sk G sgi do 
if mi, m 2 , ...TO /+1 are not with the same dycSet then 
suspect view i 

return 


FinalProofs^[i] G- {mi,TO 2 , 

for sn : l..End{\/CommitLog\3m G VCSet].^ : CommitLog is in m) do 
CommitLog\.[sn\ G- CommitLog[sn] with the highest view number 
for sn : l..End{yPrepareLog\3m G VCSet\_ : PrepareLog is in m) do 
PrepareLog\_[sn] G- PrepareLog[sn\ with the highest view number 
if Sj = psi then /* primary */ 

for sn : \..End{PrepareLog\.\CommitLog\.) do 
req G- CommitLog], .[sn\.req 

if req = null or PrepareLog ]. [sn] is generated in a higher view than CommitLog]. [sn] 

then 


req G- PrepareLog]^ [snj.reg 
PrepareLog[sn] G- {req, (prepare, D{req), sn, i)a-, .) 
send (new-VIEW, i, PrepareLog) to Vsfc G sgi 


/* replace line 26 in Algorithm 3 by : */ 

23: if PrepareLog is matching with CommitLog]^ and PrepareLog]^ then 


/* add this command after line 27 in Algorithm 3 : */ 

24: prCsj ^ i /* update the view in which PrepareLogs^ is generated */ 
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Algorithm 6 Fault detection function at replica sj. 

1: function FAULTDETECTION(FC'S'et) 

2: Vsn and m,m' G VC Set from replicas Sk and Sfc', respectively, 

3: (state loss) if Sk,Sk' G sgi> (i! < i) and CommitLog'[sn\ in m! is generated in view i' and 

PrepareLog is in m and PrepareLog[sn\ = nil then (sfc is faulty) 

4: send (state-loss, z,Sfc,sn,m,m') to Vsfc" € 11 

5: add Sk to FSet 

6: (fork-I) if Sk,Sk' G sgi> {i' < i) and PrepareLog[sn] in m is generated in view i" 

and Commit Log'[sn] in m' is generated in view i' and {{i" = i' and PrepareLog\sn\.req ^ 
CommitLog'[sn\.req) or i" < i) then (sk is faulty) 

7: send (fork-i, i, s^, sn, to, m') to Vs^" € 11 

8: add Sk to FSet 

9: (fork-II-query) if PrepareLog[sn] in to is generated in view i" (i" < i) and CommitLog'[sn\ 

in to' is generated in view i' (i' < i" < i) and {PrepareLog[sn\ = null or PrepareLog[sn\.req ^ 
CommitLog'[sn\.req) then (s^ might be faulty) 

10: send (fork-ii-QUERY, z, s^, sn, to) to Vsfc" G sgt" 

11: wait for 2A time 

12: upon reception of (fork-ii-QUERY, z, s^, sn, to) at Sj, where finalProof in to is generated in view 
z" and Sj G sgi" do 

13: if PrepareLog[sn\ in to is not consistent with VCSet\^ then 

14: send (fork-ii, z, s^ , sn, m, finalProofg- [i "\, finalSet\.) to Vs^ G 11 

15: upon reception of {fork-ii, i, Sk, sn,m, finalProof, finalSet) do 
16: add Sk to FSet 

17: upon reception of STATE-loss, fork-i or fork-ii message to do 
18: forward to to Vsfc G If 
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Appendix C XPaxos correctness proof 


In this appendix, we first prove safety (consistency) and liveness (availability) properties of XPaxos. 
To prove safety (Section C.l), we show that when XPaxos is outside anarchy, consistency is guaranteed. 
In liveness section (Section C.2), we show that XPaxos can make progress with at most t faulty replicas 
and any number of faulty clients, if eventually the system is synchronous (i.e., eventual synchrony). 

Then, in Section C.3, we prove that the fault detection mechanism is strong completeness and 
strong accuracy outside anarchy, with respect to non-crash faults which can violate consistency in 
anarchy. 

We use the notation in Figure 14 to facilitate our proof of XPaxos. All predicates in Figure 14 are 
defined with respect to benign clients and replicas. 


c,req,rep : Client c, request req from client and reply rep of req. 
delivered(c, reg, rep) - Client c delivers response rep for request req. 

before(reg,reg') - Request req is executed prior to request req', i.e., req' is executed based on execution of 
req. 

spi : the set of replicas in synchronous group i. 

accepted (c, reg, rep, f) - Client c receives t + 1 matching replies of req from every active replica in view i. 
prefix(reg, reg', Sj) - Request req' is executed after execution of request req at replica Sj. 
committed(reg, f, sn, Sj) - Active replica Sj G spi has received f+l matching prepare or COMMIT messages. 
sg-committed(reg, i, sn) - V benign active replica Sj G spi'. committed(reg, f, sn, s^). 

executed(reg, z, sn, Sj) - Active replica Sj G spi has executed request req at sequence number sn in its state. 
sg-executed(reg, z, su) - V benign active replica Sj G spp. executed(reg, z, szz, Sj). 
prepared(reg, z, srz, Sj) - Active replica Sj G spi has received prepare message at sn for req. 


Figure 14: XPaxos proof notation. 


C.l Safety (Consistency) 

Theorem 1. (safety) If delivered(c,req,rep), delivered(c',re(f ,rep'), and req / req', then either 
before(req,req') or before(req',req). 

To prove the safety property, we start from Lemma 2 which shows a useful relation between 
predicates delivered() and accepted(). 

Lemma 2. (view exists) delivered(c,req,rep) 3 view i: aecepted(c,req,rep,i). 

Proof: By common case protocol Algorithm 1 Zznes:{26-29} and Algorithm 2 /znes:{22-25}, client 
c delivers a reply only upon it receives t + 1 matching reply messages from all active replicas in the 
same view. Conversely, upon client c receives t + 1 matching reply messages from active replicas in 
the same view, it delivers the reply. □ 


Lemma 3. (reply is eorrect) If aecepted(c, req, rep, i), then rep is the reply of req executed by correct 
replica. 

Proof : 

1. 3sj G sgp. Sj is correct. 

Proof : Assumption of at most t faulty replicas and |s( 7 j| = t -p 1. 

2. Client c expects matching replies from t + 1 active replicas in spi. 

Proof : By common case protocol Algorithm 1 Zznes:{26-29} and Algorithm 2 /ines:{22-25}. 

3. Q.E.D. 

Proof : By 1 and 2. □ 
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By Lemma 2 and Lemma 3, we assume 3 view i for req and 3i' for req', then we instead prove : 

Theorem 2. (safety) If accepted(c,req,rep,i) and acceptedfc',req',rep\i'), then before(req,req') or 
before (req', req). 

Now we introduce sequence number. 

Lemma 4. (sequence number exists) If accepted(c,req,rep,i), then 3 sequence number sn: sg- 
executed(req, i, sn). 

Proof : 

1. Client c accepts rep in view i as reply of req upon: 

(1) c receives reply messages with matching ts, rep, sn and i; and, 

(2) REPLY messages are attested by t + 1 active replicas in sgi. 

Proof : By common case protocol Algorithm 1 lmes:{26-29} and Algorithm 2 /mes:{22-25}. 

2. Benign active replica Sj G sgt sends reply message for req only upon : executed(reg, i, sn, sj). 
Proof : By common case protocol Algorithm 1 lmes:{21-25} and Algorithm 2 /mes:{18-21}. 

3. Q.E.D. 

Proof : By 1 and 2. □ 

By Lemma 4, we assume 3 sequence number sn for req and 3sn' for req'. Then we instead prove: 

Theorem 3. (safety) If sg-executed(req,i, sn), s g-executed (r eq' ,i', sn') and sn < sn', thenM benign 
active replica Sj> G sgi^: prefix(req,req', sj/). 

Towards the proof of Theorem 3, we first prove several lemmas below (from Lemma 5 to Lemma 11). 
Lemma 5 proves that if a request is executed by a benign active replica, then that request has 
been committed by the same replica. 

Lemma 5. If executed(req,i, sn, sj), then committed(req,i, sn, Sj). 

Proof: By common case protocol Algorithm 1 /mes:{10-21} and Algorithm 2 /ines:{10-18}, ev¬ 
ery benign active replica first commits a request by receiving t + 1 matching prepare or commit 
messages, then it executes the request based on committed order. □ 


Lemma 6. (committed() is unique) If committed (req, i,sn,Sj) and committed(req',i, sn, Sj'), then 
req = req'. 

Proof : Proved by contradiction. 

1. We assume 3 requests req and req' : committed(reg, z, sn, sj), committed(reg', i, sn, Sj/) and 
req req'. 

Proof: Contradiction assumption. 

2. 3 correct active replica Sk G sgi : Sk has sent prepare or commit message for both req and 
req' at sn (i.e., Sk has executed common case protocol Algorithm 1 lines:{8-9} or Algorithm 1 
/ines:{15-17}, or Algorithm 2 lines:{7-9} or Algorithm 2 lmes:{14-15}, for both req and req'). 
Proof : By |sg'i| = t + 1, 3sfc : Sk is correct; then by 1, common case protocol Algorithm 1 
/mes:{18-20} or Algorithm 2 lmes:{16-17}, and definition of committed(). 

3. Q.E.D. 

Proof : By 2 and 1. □ 

Lemma 7 locates at the heart of XPaxos safety proof, which is proved by induction. By Lemma 7 
we show that, if request req is committed at sn by every (benign) active replica in the same view, 
and, if request req' is committed by any replica in the preceding view at sn, then req = req'. 
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Lemma 7. (sg-committed() is durable) If sg-committed(req,i, sn), then^i' >i : if committed (req',i', sn, Sj/) 
then req = req'. 

Proof : 

1. We assume Mi" and Sjn : i < i" < i' and Sjn G •Sfi'i", if committed(re(jr", z", sn, Sj//) then 
req = reg". 

Proof : Inductive Hypothesis. 

2. V benign replica Sj' G sgi/ : Sj/ has been waiting for view-change messages from Vsfc G H 
within 2A time. 

Proof : By committed(reg^, i', sn, Sj^), s^' has generated prepare or commit message at sn; by 
view change protocol Algorithm 3 /mes:{23,26,28}, a benign active replica generates a prepare 
or COMMIT message in view i' only npon the replica has executed Algorithm 3 lines:{lQ} in 
view i'; then by Algorithm 3 /znes:{13-15}. 

3. 3sj> G sgi' : Sj/ is correct. 

Proof : By \sgi'\ = t + 1 and at most t fanlty replicas. 

4. During view change to I, Sji has collected view-change message m from a correct active replica 
Sj G sgi. 

Proof : By 2 and 3, view change protocol Algorithm 3 /mes:{13-15} have been execnted at SjC 
Sj' polls all replicas for view-change messages and waits for response from t + 1 replicas as well 
as the timer set to 2A to expire. Assume that sj/ has received view-change messages from 
r > 1 replicas in view i. The other t -|- 1 — r replicas in view i are either faulty or partitioned 
based on definitions. Among r replicas which have replied, at most t — {t + 1 — r) = r — 1 are 
faulty. Hence, at least one replica, say, Sj G sgi is correct and has replied with m. 

5. m contains t-\-l matching PREPARE or COMMIT messages for request req" at sequence number 
sn, generated in view i" > i. 

Proof : By Algorithm 3 lines:{()-7}, benign replicas process messages in ascending view order, 
so that commit log at sn generated in view i will not be replaced by any commit log generated 
in view i'" < z; then by 4 and sg-committed(reg, z, sn). 

6. In view i', Msk' G sgi' : sy can commit req", or any ref" which is committed in view i'" > i" at 
sn. 

Proof : By Algorithm 3 lines:{19} and 5. 

7. req" = req'" = req. 

Proof : By 4 and 5, req" is committed in i" and req'" is committed in i'", where i'" > i" > i] 
then by 1. 

8. req' = req. 

Proof : By 6, 7 and committed(reg', z', sn, Sj/). □ 

By Lemma 7 we can easily get Lemma 8. 

Lemma 8. If sg-committed(req,i, sn) and sg-committed(req',i', sn), then req = req'. 

Proof: By Lemma 7 and definition of sg-committed(). □ 


Lemma 9. If executed(req,i, sn, sj), thenMsn' < sn : 3req' s.t. committed(req',i, sn', Sj). 

Proof: By common case protocol Algorithm 1 Zmes:{21-22} and Algorithm 2 lOTes:{18-19}, 
correct active replicas execute requests based on order defined by committed sequence number; by 
executed(reg, z, sn, Sj) and sn' < sn, ex.ecuted{req',i, sn', Sj)] and, by Lemma 5. □ 
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Lemma 10. (executed() in order) If committed(req,i, sn, Sj), executed(req',i, sn', Sj) and sn < sn', 
then prefixfreq, req\ Sj). 

Proof: By Lemma 9, 3req" s.t. committed{req”,i, sn, Sj); by Lemma 6, req" = req; by common 
case protocol Algorithm 1 Zmes:{21-22} and Algorithm 2 Zmes:{18-19}, benign active replicas exe¬ 
cute requests based on order defined by committed sequence number sn and sn'] and, by sn < sn'. □ 


Lemma 11. If sg-committed(req,i, sn), sg-executed(req',i, sn') and sn < sn', then\/ benign active 
replica Sj : prefix(req,req', Sj). 

Proof: By Lemma 10. □ 

Now we can prove Theorem 3. 

Proof : 

1. sg-committed(reg, sn) and sg-committed(reg', f', sn'). 

Proof : By sg-executed(reg, sn), sg-executed(reg', f', sn') and Lemma 5. 

When i < i' : 

2. sg-committed(reg, i', sn). 

Proof : By sg-executed(re( 7 ', i', sn'), Lemma 9 and sn < sn', 3req" : sg—committed{req", i', sn); 
then by Lemma 8, sg-committed(reg, i, sn) and i < i', req" = req. 

3. V benign active replica Sj/ G sgi/: prefix(reg, reg', Sj/). 

Proof : By sg-executed(reg', i', sn'), 2, sn < sn' and Lemma 11. 

When i = i' : 

4. V benign active replica Sj G sgi'. pT:eiix{req,req', Sj). 

Proof : By 1, sg-executed(re(?', i', sn'), i = i', sn < sn' and Lemma 11. 

When i > i' : 

5. 3req" : sg-committed(reg", i', sn). 

Proof : By Lemma 9, sg-executed(reg'', z', sn') and sn < sn'. 

6. req" = req. 

Proof : By 5 and Lemma 8. 

7. V benign active replica Sj/ G sgi/: prefix(reg, reg', Sj). 

Proof : By 5, 6, sg-executed(reg', f', sn'), sn < sn' and Lemma 11. 

8. Q.E.D. 

Proof : By 3, 4 and 7. □ 

C.2 Liveness (Availability) 

Before proving liveness property, we first prove two Lemmas (12 and 13). 

Lemma 12. If a correct client c issues a request req in view i, then eventually, either (1) acceptedfc, req, rep, i) 
or (2) XPaxos changes view to i + 1. 

Proof : 

1. We assume accepted(c, reg, rep, z) is false, then we prove that eventually view i is changed to 
z -I- 1. 

Proof : Equivalent. 


35 


2. Client c sends req to every active replica npon timerc expires. 

Proof : By 1, c is correct, and Algorithm 4 lines:{l-2}. 

3. No replica in sgi sent matching signed-reply message for req to client c. 

Proof : By 1, c is correct and Algorithm 4 lines:{18-22}. 

4. 3 active replica Sj G sgi: Sj is correct. 

Proof : By assumption |sg'i| = t + 1 and at most t faulty replicas. 

5. Sj has not received t + 1 matching signed reply messages for req. 

Proof : By 3, 4 and Algorithm 4 /mes:{18-22}. 

Either, 

6. Sj starts timerreq^- 

Proof : By 2, 4 and Algorithm 4 lines:{3,6}. 

7. Sj suspects view i when timerreq^ expires. 

Proof : By 4, 5, 6 and Algorithm 4 /OTes:{8-10}. 

or, 

8. Sj starts timer}^ in view change to i. 

Proof : By Algorithm 3 lines:{17>}. 

9. Sj suspects view i when timer}^ expires. 

Proof : By 2, 8 and Algorithm 3 lines:{34-35}. 

10. Q.E.D. 

Proof : By 1 and 7, 9. □ 

Lemma 13. If a correct client c issues a request req in view i, the system is synchronous for a 
sufficient time andM active replica Sj G sgi: Sj is correct, then eventually accepted(c,req,rep,i). 

Proof : 

1. All active replicas in sgi and c follows protocol correctly. 

Proof: c is correct and V active replica Sj G sgi: Sj is correct. 

2. No timer expires. 

Proof : By 1 and the system is synchronous. 

3. No view change happens. 

Proof : By 1 and Algorithm 3 lines:{l-7}, no faulty replica in sgi, and no faulty passive replica 
in view i can suspect view i deliberately; and by 2, no correct replica in sgi suspects view i. 

4. accepted{c, req, rep, i). 

Proof : By 3 and Lemma 12. □ 

Theorem 4. (liveness) If a correct client c issues a request req, then eventually, delivered(c, req, rep). 
Proof : Proved by Contradiction. 

1. We assume delivered(c, reg, rep) is always false. 

Proof: Contradiction assumption. 

2. If current view is i, then view is eventually changed to i -\-l. 

Proof : By 1, Lemma 2 and Lemma 12. 
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3. View change is executed for infinite times. 

Proof : By 1 and 2, Algorithm 4 /OTes:{ll-15} and Algorithm 4 lines:{l-2}, correct client c 
always multicasts SUSPECT message and req to every active replica in new view. 

4. Eventually the system is synchronous. 

Proof: Eventual synchrony assumption. 

5. 3 view i': V active replica Sj/ G sgii s.t. Sj/ is correct. 

Proof: View change protocol is rounded among combinations of 2t + 1 replicas, among which 
there exists one synchronous group containing only correct active replicas. 

6. accepted(c, req, rep, f). 

Proof : By 3, 4, 5, Lemma 13 and c is correct. 

7. Q.E.D. 

Proof : By 1, 6, Lemma 2 and contradiction. □ 

C.3 Fault detection (FD) 

In this section we prove that the fault detection mechanism is strong completeness and strong accuracy 
outside anarchy. 

At first, in Definition 4 we define the type of messages which can possibly violate consistency in 
anarchy. 

Definition 4. (non-crash faulty message) In view change to i, a view-change message m from 
replica Sk is a non-crash faulty message if : 

(i) m is sent to a correct active replica Sj G sgi; 

(a) 3 view i' < i and request req : sg — committed{req,i',sn); 

(Hi) at least one of two properties below is satisfied : 

(1) Sk G sgr and in m : PrepareLog\sn\ is generated in view i” < f; or, 

(2) in m : PrepareLog[sn\.req req and PrepareLog[sn] is generated in view i" > i!; and, 

(iv) '^i'” (i!" > i” and i'" > i!) and Sk'" G sgrn : committed{req,i'", sn, Sk'")- 

Then we can prove: 

Lemma 14. If a view-change message m is not a non-crash faulty message, then m cannot violate 
consistency in anarchy. 

Proof : Proved by Contradiction. 

1. If Definition 4 property [i) is not satisfied, then either m is sent to a non-crash faulty replica, 
based on our model we have no assumption on non-crash faulty replicas, so m should not affect 
the state of any correct replica; or m is sent to a crashed or passive replica, which just stops 
processing or ignores m. 

2. If Definition 4 property {ii) is not satisfied, then req has not been committed by some correct 
replica in sgr, hence accepted(c, reg, rep, z') is not true. 

3. If neither of Definition 4 property (zzz).(l) or (2) is satisfied, then either Sk G sgr and m 
contains prepare log of req at sn generated in view i" > i, so by Algorithm 5 fznes:{ll-21}, 
m facilitates req to be committed in view z; or, if Sk ^ sgr, then either PrepareLog[sn].req is 
generated in z" < z', even if PrepareLog[sn\.req req, based on Algorithm 5 lmes:{13,14,18} 
PrepareLog\sri\.req cannot be selected in view change to i if no (faulty) replica in i' sends 
inconsistent message (e.g., a prepare log generated in view lower than i' by sy G sgi>), hence 
we consider in this case Sk is harmless; or i” > i and PrepareLog[sn\.req = req, the argument 
is the same as before. 
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4. if Definition 4 property (iv) is not satisfied, then {i'" > i" and i"' > i') and sy" G •sfl'i"' 

: committed{req,i'", sn, sy)- In this case, to modify req committed at sn, at least one of 
(faulty) replicas in sgin' has to send a non-crash faulty message; otherwise, based on Algorithm 5 
/mes:{ll-21}, any non-crash faulty message generated in i" will be ignored. 

□ 

Finally, we prove fault detection property : strong completeness and strong accuracy. Roughly 
speaking, (strong completeness) if a message is a non-crash faulty message, then the sender will be 
detected eventually; otherwise, (strong accuracy) if a replica is correct, then it will never be detected. 

Theorem 5. (strong completeness) If a replica Sk fails arbitrarily outside anarchy, in a way that 
would cause inconsistency in anarchy, then XPaxos FD detects Sk as faulty (outside anarchy). 

Proof : 

1. By Lemma 14, it is equivalent to prove : in view change to i, if m is a non-crash faulty message 
from replica Sk, then correct active replica Sj G sgi detects the fault of Sk- 

2. By Definition 4 property {ii), every correct replica sy G sgii has commit log of req at sn 
generated in view equal to or higher than i'. Assume that the highest view in which commit log 
of req is generated is zq {i' < io < i) ■ 

Proof : By Lemma 7. 

If in 2 zo = : 

3. Correct active replica Sj G sgi should receive m' which contains commit log of req generated in 
view i' from correct active replica sy G sgii. 

Proof : By outside anarchy, 2, Definition 4 and Lemma 7. 

4. If m satisfies Definition 4 property {iii).{l), then sj detects the fault of Sk- 

Proof : By Definition 4 property {iii).{l), prepare log of req is not included in m; then by 3 
and Algorithm 6 lines:{3}, the fault is detected. 

5. If m satisfies Definition 4 property (in).(2), then Sj detects the fault of s^- 

Proof : By Definition 4 property {Hi).{2), the prepare log at sequence number sn is generated 
in view i" < i, then by 3 and Algorithm 6 /mes:{6} the fault of s^ is detected. 

6. If m satisfies Definition 4 property {Hi).{3), then Sj detects the fault of Sk- 

Proof : If in Definition 4 property {Hi).{3) i" = i!, then by 3 and Algorithm 6 lines:{6} the 
fault of Sk is detected; otherwise, if i” > i', then based on Lemma 7 req must be retrieved by 
every correct active replica in view i"; hence by outside anarchy and Algorithm 6 lines:{9-14:} 
the fault of Sk is detected. 

If in 2 io > i': 

7. Every replica (correct or faulty) in view io has retrieved and prepared req in view equal to or 
higher than io. 

Proof : By 2, i”' > i' and Algorithm 3 /ines:{26,28}. 

8. In order to modify request committed at sn (i.e., reef), at least one of (faulty) replicas, say sy 
(in sgiQ or not), has to send an inconsistent prepare log generated in view ii > io. Hence, Sk in 
this case is harmless. 

Proof : By 7, n = 2t -|- 1, io > i' and Algorithm 5 /ines:{19}. 

9. Correct active replica Sj G sgi should receive m' which contains commit log of req generated in 
view i 2 {i' < i 2 < B < ^) from correct active replica sy G sgi> . 

Proof : By outside anarchy, 2, Definition 4 and Lemma 7. 
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10. If 12 < ii, then the fault of Sk"' is detected by Algorithm 6 Zmes:{9-16}, which is similar to 
discussion in 6; if 12 = ii, then the fault of Sk'" is detected by Algorithm 6 lines:{3,6}, which is 
similar to discussion in 4 or 5. 


11. Q.E.D. 

Proof : By 3, 4, 5 and 6 and 10. 


□ 


Theorem 6. (Strong accuracy) If a replica Sk is benign (i.e., behaves faithfully), then XPaxos FD 
will never detect Sk as faulty. 


Proof : 


1. It is equivalent to prove : in view change to i, if s^ is benign and Sk sends a view-change 
message m to all active replicas in view i, then no active replica in sgi can detect Sk as faulty. 
Proof : Equivalent. 

V request req, view i' < i and replica Sj/ s.t. Sk,Sji G sgi> and committed(reg, i', sn, Sj/): 

2 . m contains prepare log of req' at sn generated in view i" > i'. 

Proof : By common case protocol Algorithm 1 /mes:{9,17}, Algorithm 2 lines:{9,15} and view- 
change Algorithm 5 lines:{l}, Sk sends a prepare log at sequence number sn once Sk prepared 
a request at sn; by Algorithm 3 lines:{6-7}, correct replicas process messages in ascending view 
order, hence i" > i'. 

3. Sk will not be detected by Algorithm 6 lines:{2>} due to committed(re( 7 , i', sn, Sj/). 

Proof : By 2 and Algorithm 6 /mes:{3}. 

4. No other request req" 7 ^ req' is committed by any replica at sequence number sn in view i'. 
Proof : By s^ is correct and Lemma 6 . 

5. Sk will not be detected by Algorithm 6 lines:{6} due to committed(re( 7 , i', sn, Sj/). 

Proof : By 4 and Algorithm 6 lines:{6}. 

6 . Sk will not be detected by Algorithm 6 lines:{9} due to committed(re( 7 , i', sn, Sj/). 

Proof : By Sk is correct, Sk did not generate or accept any incorrect prepare log during view- 
change to view i"; by Algorithm 6 lines:{9}, Algorithm 5 lines:{3-7} and Lemma 7, no conflict 
vcSet\, and finalProofgy [i"] exists in view i" at any active replica. 


7. Q.E.D. 


Proof : By 3, 5 and 6. 


□ 


We can easily prove that if a fault is detected by any correct replica, then the fault is detected by 
every replica eventually. 

Lemma 15. In view change to i, if a correct active replica Sj G sgi detects the fault of Sk, then 
eventually every correct replica detects the fault of Sk- 


Proof: By Algorithm 6 Hnes:{6-7}. 


□ 
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Appendix D Reliability analysis (examples) 

In Table 5 and 6 we show the nines of consistency of each model when t = 1 and t = 2 for some 
practical values of ^benign, ^synchrony and Si correct, in Table 7 and 8 we show the nines of availability of 
each model when t = 1 and t = 2 for some practical values of 9avaiiabie and 9benign- 



9o/C{XPaxost=l) 


^benign 

9ofC{CFTt=i) 

^correct 

^synchrony 

9ofCiBFTt_i) 

2 

3 

4 

5 

6' 

3 

2 

2 

3 

4 

4 

4 

4 

5 

4 

3 

2 

4 

5 

5 

5 

5 

7 

3 

5 

5 

6 

6 

6 

5 

4 

2 

5 

6 

6 

6 

6 

9 

3 

6 

6 

7 

7 

7 

4 

6 

7 

7 

8 

8 

6 

5 

2 

6 

7 

7 

7 

7 

11 

3 

7 

7 

8 

8 

8 

4 

7 

8 

8 

9 

9 

5 

7 

8 

9 

9 

10 

7 

6 

2 

7 

8 

8 

8 

8 

13 

3 

8 

8 

9 

9 

9 

4 

8 

9 

9 

10 

10 

5 

8 

9 

10 

10 

11 

6 

8 

9 

10 

11 

11 

8 

7 

2 

8 

9 

9 

9 

9 

15 

3 

9 

9 

10 

10 

10 

4 

9 

10 

10 

11 

11 

5 

9 

10 

11 

11 

12 

6 

9 

10 

11 

12 

12 

7 

9 

10 

11 

12 

13 


Table 5: 9ofC{CFTt=i), 5o/C'(XPaxoSf=i) and 9ofC{BFTt=i) values when 3 < 9 benign < 8, 2 < 

9 synchrony Fi 6 and 2 < 9 correct ^ 9 benign' 



9ofC(XPaxost=2) 


^benign 

9ofC(CFTt^2) 

^correct 

^synchrony 

9ofC{BFTt=2) 

2 

3 

4 

5 

6 

3 

2 

2 

4 

5 

5 

5 

5 

7 

4 

3 

2 

5 

6 

6 

6 

6 

10 

3 

6 

7 

8 

8 

8 

5 

4 

2 

6 

7 

7 

7 

7 

13 

3 

7 

8 

9 

9 

9 

4 

7 

9 

10 

11 

11 

6 

5 

2 

7 

8 

8 

8 

8 

16 

3 

8 

9 

10 

10 

10 

4 

8 

10 

11 

12 

12 

5 

8 

10 

12 

13 

14 

7 

6 

2 

8 

9 

9 

9 

9 

19 

3 

9 

19 

11 

11 

11 

4 

9 

11 

12 

13 

13 

5 

9 

11 

13 

14 

15 

6 

9 

11 

13 

15 

16 

8 

7 

2 

9 

10 

10 

10 

10 

22 

3 

10 

11 

12 

12 

12 

4 

10 

12 

13 

14 

14 

5 

10 

12 

13 

15 

16 

6 

10 

12 

14 

16 

17 

7 

10 

12 

14 

16 

18 


Table 6: 9ofC{CFTt= 2 ), 9ofC{XPaxo5t=2) and 9ofC{BFTt=2) values when 3 < 9benign < 8, 2 < 

9 synchrony — h and 2 ^ 9 correct ^ 9 benign- 
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9ofA(CFTt=i) 


^available 

^beniqii 

9ofA{BFTt^i) 

5o/A(XPaxos£^i) 

3 

1 

5 

6 

7 

8 

2 

2 

3 

3 

3 

3 

3 

3 

3 

3 


3 

4 

5 

5 

5 

5 

5 

i 



4 

5 

6 

7 

7 

7 

5 




5 

6 

7 

9 

9 

6 





6 

7 

11 

11 


Table 7; 9ofA{CFTt=i), 9ofA{BFTt=i) and Po/A(XPaxost=i) values when 2 < ^available < 6 and 

^available ^ ^benign — 



9ofA[CFTt=2) 


azZaflie 

^benign 

9ofA(BFTt^2) 

9ofA(XPaxost^2) 

3 

4 

5 

6 

7 

8 

2 

2 

3 

4 

4 

4 

5 

4 

5 

3 


3 

4 

5 

6 

7 

7 

8 

4 



4 

5 

6 

7 

10 

11 

5 




5 

6 

7 

13 

14 

6 





6 

7 

16 

17 


Table 8; 9ofA{CFTt= 2 ), 9ofA{BFTt=2) and 9ofA{XPaxost=2) values when 2 < ^available < 6 and 

^available ^ ^benign ^ 8. 


41 




